The best product insight I had this year didn't come from a book, conference, AI bootcamp or a Substack. It came from watching my Ranunculus go from seven petals last year to an extraordinary, layered showstopper this year — same bulbs, same bed, same care — and realizing I was looking at the most honest explanation of non-determinism I ever saw.

I am currently fine-scaping my garden. It is a tiny plot of land around my house, which I am trying to use super efficiently planting 24 varietals of Chrysanthemums, 7 types of Dahlias, 4 types of Ranunculus, and a growing collection of whatever else caught my eye along the way. The discipline it demands — what to plant, what to cut, what to give time — maps more precisely to building AI products than anything else I have encountered. Here is what the dirt taught me.

Open Beds: Building amidst Non-Determinism

Traditional software is like a controlled greenhouse. You designed the beds, chose the seeds, set the hours of sun and moisture, and determined what grows where. An agentic product is more like an open meadow — fertile and prepared, but with no walls. Non-determinism changes everything about how you build, starting with inputs.

"Users can plant any intent into an agentic product: precise or vague, well-formed or completely unexpected." — Nargis Sakhibova

Input design is a challenge for AI products. Users can plant any intent into an agentic product: precise or vague, well-formed or completely unexpected. Your product will receive seeds it was never designed for, and it needs a response to all of them. The answer is not to wall the meadow back in. It is to handle what lands before it reaches the model. Start with naming the product and setting any UX writing to clearly indicate what the product does. Handle input by classifying intent upstream. Identify requests that fall outside your product's designed scope and route them appropriately. A user asking your enterprise agent to write a poem is not a prompt engineering problem. It is a routing problem. Solve it before it reaches the soil.

The output problem is stranger. In my first year growing Ranunculus, my flowers got seven petals each — technically a flower, but barely. Last year, same bulbs, same bed, same care — a showstopper. Layered, countless petals, the kind of bloom that stops you mid-step. Nothing I did differently explained it. This is how non-determinism manifests. Not failure. Just the honest behavior of a living system responding to conditions you only partially control. The model is the Ranunculus. You can tend it carefully and still not know exactly what will bloom.

This is the central tension of every agentic product: we want the creativity and adaptibility of a biological system and the reliability of a machine. You do not fully get both. Your job is to build honestly around that constraint — and to be straight with your users about what they are working with.

Blind Blooms: When Confidence Is Not a Signal

Blind blooms of Foxglove in dark privet color
Blind blooms look healthy and can photograph well, but are useless. / credit - Gemini restyling my blind bloom photos

I bought a lot of Foxglove seeds one season, charmed by photographs of their tall, spired blooms, and inspired by success I have had with other Foxgloves. The plants came up strong. The foliage was a beautiful dark color — richer and larger than any other Foxglove in my garden, several times the size of the others. I watched them for months, waiting. Nothing came. Not a single bloom. All that vigorous, confident growth, and no flower.

Growers call this a "blind bloom." The plant produces everything that looks like health and outputs nothing of value.

In the early days of agentic products, I asked one of the mainstream agentic products to spin up a Virtual Machine for fun. The response was confident and detailed — active instance, specs, status. The reality - there was no VM. A perfect theater of productivity with nothing behind the curtain.

If your agent cannot explain why it did what it did, you have not built a tool. You have built a prop. The job is to build systems with verifiable feedback loops — and to resist the very human temptation to be impressed by confident output before you have verified what is actually behind it.

Pot Tests: When to Build Fast and When to Commit

Last year I planted 24 Chrysanthemum varietals directly into the ground and created total chaos. If I had grown a few in pots first, I would have seen how they behave, how much space they take, whether I even liked them next to each other — before committing them to the ground and disrupting everything around them.

The pot lets you see the thing in reality before reality becomes expensive. Vibe Coding works the same way. Build the rough version, put it in front of a potential user, and find out whether the idea works before you have restructured your architecture around it.

But a pot is a tool for learning, not always a permanent home. I grew a Sam Hopkins Dahlia last year. It is a deep burgundy almost black, extraordinary varietal when it performs. I planted in a pot, as I could not decide where in the garden it belongs. I knew I did not wanted in a pot permanently, but as summer progressed, and my garden filled in with other plants, I got lazier and ended up keeping it in the pot. The stalks came up fragile. By the end of the season the plant made three flowers in total. Every other Dahlia in my garden, planted properly in the ground, gave me continuous countless flowering.

Some plants genuinely thrive in pots. An internal tool, a one-off workflow, a narrow proof of concept — these can live in the prototype indefinitely and that is fine. But if you plan to scale the idea, the pot will eventually limit it. The container that protected the idea early on becomes the thing constraining its growth. Eventually you have to wait until winter and the plant is dormant to move it.

The signal that it is time to commit is when the prototype is no longer generating new information. When you are learning the same things over and over, you are not still testing. You are stalling.

Plant it. Deal with what grows.

Hagoromo: First Blooms Are Not the Final Answer

While researching Chrysanthemums I fell in love with a variety called the Hagoromo — a Japanese cultivar whose blooms, in photographs, were as large as a human head. Extraordinary, almost surreal. It is rare and not sold everywhere. I had to track down a specialized nursery in Oregon. Eventually, they were gracious enough to give me two cuttings.

I planted them and watched them struggle. I troubleshot systematically: sun, soil, fertilizer. I cut back the surrounding gladiolus to give them more room. What I got at the end of the season was a few miniscule blooms — remarkably rich in color, but nowhere near the showstoppers I had seen in photographs. Not yet.

Often the first result in an ambitious agentic product will look the same way — small and underwhelming. This is a tough moment, because the pressure to call it quits at this point is enormous. Stakeholders who approved the investment based on the demo are now looking at something that does not match what they said yes to. The temptation is to either oversell what you have or abandon the bet entirely.

Neither is the right move. The more useful question is: is this result underwhelming because the problem is wrong, or because the plant is still establishing its roots?

There are signals that tell you which situation you are in. If users who engage with the early version find it genuinely useful — even in limited ways — that is a root signal. If the feedback is consistently that the core interaction makes no sense, or you do not see an improvement in the segment of traffic that is most likely to benefit from the product, regardless of quality, that is a wrong bet, not an early one.

Protect the work long enough for it to establish, but define in advance what signal would tell you the bet was wrong. Without that definition, you are not being patient. You are deferring a decision you were not ready to make.

The Hagoromo is still in the ground, I will see what this season brings. Unlike with Chrysanthemums, you do not need to wait for another full season — models improve fast, and another iteration is days away, not months. But know what you are waiting to see before you wait.

Diverse Gardens: Resilience Requires Variety

My goal when I started fine-scaping was a garden that blooms from the first warmth of spring through Christmas (since I live in California and we don't have frost).

A bouquet of diverse cut flowers burgundy, orange, yellow in color, various shapes, in a vase.
An illustration of a real bouquet from my garden. Transformed by Gemini.

A monoculture cannot deliver that. Resilience over a full year requires diversity by design — different varieties blooming at different times, carrying the garden through seasons that would otherwise leave it empty.

A behaviorally homogeneous user base does the same thing to an agentic product that a monoculture does to a garden. It performs well in the conditions it was designed for and reveals nothing about what happens outside them.

The diversity that matters most is behavioral. You need users who ask simple questions and users who ask compound multi-step ones. Domain experts who push the boundaries of what your product knows, and novices who ask things experts would never think to ask. Each type stresses a different part of your system.

This is how you stress test your input layer. A behaviorally diverse user base tells you whether your routing holds, whether your intent classification covers the cases you designed for and the ones you did not, and where the soil is too shallow to support real growth. Resilience is not built in the architecture alone. It is earned through the range of people you let into the garden early.

Key Framework

User diversity is a launch-blocking criterion -- monocultures will cost you in production. Behavioral diversity informs if your routing holds, your intent classification is robust enough for production, and where your soil is too shallow to support real growth.


Daffodils: Why Your Evals Are Measuring the Wrong Thing

My name, Nargis, means daffodil. When I started fine-scaping, planting daffodils felt personal. I bought a large bag, planted them carefully, and then began the slow ritual of watching. Leaves pushing through mulch, first buds slowly appearing with the spring.

When my daffodils finally opened, they were half the size I expected. Still fragrant. Still beautiful. But half the size. I started troubleshooting: water, soil, fertilizer, sun exposure. I moved some to a different part of the garden, added fertilizer. I kept searching for the variable I had gotten wrong.

Two daffodil flowers held in a hand, bright yellow in color, roughly an inch in size each.
Restyled image of real dwarf daffodils from my garden. Created with Gemini.

Eventually I realized there was no variable to fix. They were dwarf daffodils. The benchmark I had been measuring against — the full, generous bloom I had been picturing for months — was never what I had planted. My expectation had been built on the wrong reference point from the beginning.

This is what broken evals look like. Your raters arrive with a mental image of what a good output should be — shaped by their perception of what your product does. When the actual output arrives, they measure it against that image, find it wanting, and flag it as a failure. If you took evals at their face value, you would start changing prompts, adjusting models, tuning parameters — wasting time to fix a gap that is not a quality problem. It is a calibration problem.

There is a second layer that makes this harder. Raters are afraid of missing something. An ambiguous output is safer to mark as an error than to let through, so that is what raters do. The result is an eval pipeline systematically biased toward rejection, measuring against what good means to them, when the actual question is far more specific: is this a good response for this user in this context? Those are completely different questions.

Nobody gets evals right on the first attempt. Raters need to be calibrated — not just trained on a rubric, but trained on a grounded, realistic definition of what good looks like for this product, these users, this use case. The goal is not an idealized response measured against the best demo you ever saw. It is a genuinely useful response for the person who actually sent that input.

The daffodils were not wrong. The picture in my head was. — Nargis Sakhibova

Building agentic products is not an engineering problem with a gardening metaphor bolted on. Time spent in my garden and behind my laptop taught me that it is closer to cultivation than construction. You are working with living systems — models that respond to conditions you only partially control, users who bring intentions you never designed for, evals that require benchmarks you have to earn through real usage rather than define in advance. The difference between a product that lasts and one that peaks and goes bare is whether you knew which one you were growing.

∑ · ✦ · π
NS

Nargis Sakhibova

Product Manager at Google. Economist. Writer. Five languages, one calendar. Previously Adobe, Analysis Group and international development organizations.