What to measure when AI is running your product

Jun 14

Last week Josh and I hosted one of our Sense & Respond Learning webinars with Ben Yoskovitz.We go back with Ben to the early 2010s, when he and Alistair Croll wrote Lean Analytics in the same series that published our Lean UX book. He has started five companies, run product inside a few more, and today he runs Highline Beta, a venture studio that builds businesses with big companies. He spends his days at the zero-to-one edge, validating problems and shipping things, with Claude Code running in the background while he talks. If anyone has earned the right to talk about how AI changes what we measure, it's Ben.

Before I handed him the floor, I shared a little cartoon I'd drawn in Mural. You've heard me say it a hundred times: outcomes over outputs. We measure what people do, the change in their behavior, as the proxy for whether we delivered anything of real worth. That idea has held up for a long time. What struck me preparing for this session is how much AI quietly scrambles the inputs to that equation, and how few teams have updated their scorecard to match.

Why AI breaks the product metrics we've always trusted

I opened with four questions for Ben to answer.

The first is about how we build. We used to measure engineering efficiency with crude proxies, lines of code being the most infamous. Drop an AI coding agent into the team and the question gets slippery. Are we measuring whether the team is efficient, or whether the bot is? The second is about how people use our systems. For decades those systems were deterministic, with clear expectations about what goes in and what comes out. Put AI in the middle and the same input from three different people can produce three different answers. What does success even look like when the output is non-deterministic and unique to whoever typed the prompt? The third is about who the user is. We've spent years telling teams to know their customer, and that worked beautifully while the customer was a human. Increasingly the thing using your product is another piece of software. And the fourth is about money. The old model was predictable: humans show up, take actions, transact, you make margin. Throw agents into the mix as users and the revenue math gets a lot harder to forecast.

Those are my questions. Ben spent the hour turning them into something you can act on, and he organized it as a set of shifts in what we measure on the product side and on the business side.

The product metrics that change the moment you add AI

Ben's first move was to reframe engagement. Stop staring at whether a number is going up or down and start understanding what people are actually spending their time on inside the AI features. He thinks in loops, and pointed back to Nir Eyal's Hooked model as still useful, just more complicated now. The user invokes some AI capability, gets an output, and then there's a logical next step you want them to take. If they move to that step quickly, they probably got what they needed. If they're stuck in the AI interaction, retrying, rephrasing, never advancing, you have a problem hiding in plain sight. In many ways it's the old funnel, with a quality dimension layered on top.

The harder shift is quality itself, because quality is now distributed and your users don't all perceive it the same way. Ben told a story about an HR person who prompted an onboarding guide into existence, got it back as HTML, needed a PDF, and so took screenshots of the HTML pages and stitched them into a document. Ask that person how the quality was and they'll shrug and say it was fine. A less sophisticated user often can't tell they got a worse result. That's why Ben pushes teams to measure quality across cohorts, separating new users from power users, the tech-savvy from the not. He also flagged customer support as a leading indicator in a new way. It always told you about churn risk. Now it tells you something earlier, because the questions arriving are less "how do I do this" and more "wait, is my data safe, can I even trust this thing," often based on fears that aren't founded but shape adoption anyway.

Then there's the shift I keep coming back to, which is agents using your product. Ben's e-commerce example landed hard. A human shopper searches, clicks a product, adds to cart, gets nudged toward a recommendation, and your analytics are tuned for exactly that journey. Send an agent instead and it scrapes the entire catalog in five seconds, ignores your funnel completely, maybe spins up five browser windows, maybe just buys the thing on your behalf. If you have an API or an MCP endpoint, this is already happening whether you've instrumented it or not. His advice is to start baselining how agents move through your software now, because they behave nothing like the humans your dashboards were built around. The connective tissue under all of this is an eval harness. If you can't systematically evaluate the output your AI produces, you have no reliable way of knowing whether you're delivering anything at all.

Why your power users might be your most expensive customers

The business shifts are where I saw a few people in the chat go quiet. Pre-AI, we loved power users without reservation. They used the product constantly, rarely churned, invited their colleagues, and cost us almost nothing at the margin. That last part was the whole promise of software. Every additional user was nearly free.

Tokens broke that. The compute you burn to generate value for a customer costs real money, which means you now carry a variable cost of goods sold that scales with usage. Ben's own war story made it concrete. He ran an experiment using a new multi-agent workflow feature, let it run in the background, and got a note from his co-founder asking where all the money went. The thing had spun up a small army of agents that chewed through four million tokens arguing with each other. He was fine, his house is safe, but the point stuck. Your most engaged users can quietly become your least profitable ones. So Ben wants three numbers most teams have never tracked: gross margin per active user, cost per successful task, and model cost as a percentage of revenue. Cost per successful task is the one I'd start with, because it forces you to tie spend to something a customer actually accomplished rather than raw consumption.

This is also why pricing can no longer sit only with sales. The people who understand where the costs live are product and engineering, and that knowledge has to shape the model. I suspect we'll see real pressure on flat monthly fees as customers realize they can vibe-code a cheaper alternative for the simple stuff, paired with growing comfort with usage and outcome-based pricing for the work that genuinely matters. Ben's own synthetic research tool is a good illustration. He knows each interview it runs costs him about a dollar fifty-five, which is the kind of unit-level fact you have to know before you can price anything sanely.

Shipping features got easy, so experimentation got serious

Here's the part I'd underline twice. Experimentation used to feel like vanity to a lot of teams. How many experiments did we run, who cares. The enterprise version of that crime was bragging about how many features you shipped. Ben's argument is that because building and shipping is now almost free, the rigor has to go up, not down. When you can "vibe-stuff" anything into the product over a weekend, the discipline of deciding what to build, forming a real pre-launch hypothesis, and estimating what a feature will cost to run becomes the actual job. Ease of production raises the bar on judgment rather than lowering it.

He closed his Q&A on something genuinely mind-bending. When agents start operating your software through APIs, the UI can disappear entirely. Why log in and click buttons when you can tell your agent to go do the whole thing and email you the result? I'll admit I don't love where that points, and Ben doesn't either. But the experience still exists even when the interface doesn't. UX doesn't die in that world. It just stops being something you can see on a screen, which is going to force a lot of us to rethink what we even mean by the word.

What to actually do with this

If you build AI into your product, the takeaway is to rebuild your scorecard around three questions. How does the user get value, how do you know they got it, and what did it cost you to give it to them. Practically, pick one AI feature this week and instrument the loop around it: time to a useful output, whether the user advances to the next step, and the token cost of that single successful task. That one measurement will tell you more than another month of watching your top-line usage chart bounce around.

I came into this convinced that outcomes over outputs still holds. I left even more sure that it does, and a little humbled by how much harder it just got to measure. Worth coming back to this one in a few months to see how much has already changed again.

Jeff Gothelf

What to measure when AI is running your product

Why AI breaks the product metrics we've always trusted

The product metrics that change the moment you add AI

Why your power users might be your most expensive customers

Shipping features got easy, so experimentation got serious

What to actually do with this

Learning for the next century of work

Quick Links

Sign up for alerts!

What to measure when AI is running your product

Why AI breaks the product metrics we've always trusted

The product metrics that change the moment you add AI

Why your power users might be your most expensive customers

Shipping features got easy, so experimentation got serious

What to actually do with this

An AI Agent in Every Step Won't Tell You Which Steps Were Worth Taking

How IKEA Turned an AI Chatbot Into €1.3 Billion in New Revenue

Learning for the next century of work

Quick Links

Sign up for alerts!