Rethinking Metrics in the Age of AI - CMO & CTO (An AI Generated Experiment to the past)

When machines draft copy and code on the first try, the scoreboard we trusted stops making sense.

Rethinking metrics in the age of AI starts with admitting that the old numbers were built for a world where people did all the work and tools only amplified fingers on keyboards, because now the tool is a teammate, it proposes, critiques, and negotiates with your input, and that means velocity charts, lines of code, and clickthrough rates tell a thinner story than we need, they capture output but not the path to get there, they reward motion even when it hides rework, and they miss the moment where a small prompt tweak saves a week of engineering or a creative brief sparks ten testable ideas, so the core move is to measure the conversation and the decision, not just the artifact, which is why the winning teams I see are tracking time to confidence on a task, percentage of work touched by AI with a quality bar, and feedback loops that show the model learned something real from each round.

On the engineering side, a lot of us leaned on burn down charts and code throughput, but when a copilot writes a scaffold in seconds, the question shifts to what you kept, what you rejected, and why, and the key signals become review depth per change, defect escape rate across releases, and incident minutes to clarity when something goes wrong, because that tells us if AI made the system safer or just faster at creating mess, so we add prompt and context quality checks to pull requests, record assist ratio in the editor to see how much of a file was suggested versus authored, and watch first pass acceptance on generated tests as a proxy for how well the model understood the spec, and we compare time to first correct run before and after AI help to see real gains that do not depend on word count or token volume, then we wrap this with simple gates like no silent retries in agents, strict input logging without personal data, and a change review checklist that asks when and where the model influenced a decision.

On the marketing side, content machines can spin up a sea of assets, but reach without relevance is a sugar high, so the more honest scorecard mixes creative throughput with on brand rate judged by humans, adds lift over control for each asset family, and keeps a consent ledger that ties every run to allowed data and approved voice, then we track search share of intent around category terms, assist rate of AI in campaign planning, and brief to publish time across teams to see if the machines are unblocking ideas or just flooding inboxes, and we bring back holdouts and switchback tests to separate model hype from real incremental sales, while treating attribution as a team sport where the model, the channel, and the human all get partial credit, since a chat reply that pushes someone to a demo can matter more than a last click that squeaks in at midnight, so I like to log path contribution that scores touches by sequence and content, plus attention quality using scroll depth, replies, and saves, not just impressions and vanity likes.

This is where attribution gets honest, because AI now persuades across the whole stack, from a suggestion in a help doc to a reply in support to the headline in a retargeting ad, and the best way I have found to give credit is to frame a few simple questions and run cheap experiments, ask what happens when the model is off for a slice of traffic, what changes when you swap the system prompt, how does the path change if you remove one agent from the flow, then measure incrementality with geo split tests, counterfactual paths built from prior logs, and sequence weight that looks at order not just presence, and keep a plain rule in place that the bigger the claim the stronger the test, because nobody wants a dashboard that pays for its own fairy tale, and if your brand runs on trust you need to be able to say which part of the machine helped and where a human took the wheel.

Behind all of this sits data health, because AI measurement fails fast when prompts, context, and output are lost in chat history, so I push teams to keep a metric registry with plain names and owners, a prompt registry with versions and last review date, and model cards that include limits, guardrails, and training notes, plus a consent map that tags every feature with what data it can touch, and a simple red team log that stores the worst cases we tested, and then we make metrics observable like we do services, with dashboards for drift in quality scores, alerts on feedback volume swings, and a weekly ritual where we retire one metric that no longer matters, because a small set of living numbers beats a graveyard of charts that nobody trusts.

Ownership needs a rewrite too, because when a model drafts the first version and a human edits it into something we can ship, who owns success and who owns failure, and the most durable answer I have found is to move away from solo hero numbers and toward shared outcomes, set team goals that cut across functions, like time to first value for a new user or cost per qualified lead with a quality floor, add shared SLOs where engineering, data, and marketing all watch the same gauges, and reward the moments where someone pulled work out of the system rather than stuffing more into it, think fewer pages of code that do more, fewer assets with more reuse, fewer steps with a clearer path, and make it safe to say no to vanity work because the scoreboard now pays for flow, not flurry.

If you want practical metrics that travel well between dev and growth teams, start with a small spine you can explain on a whiteboard, track time to clarity on a ticket or brief, since clarity beats speed when the model is guessing, track assist coverage which is the share of work touched by AI with a quality label from one to five, track review depth by counting comments per change and edits per asset before publish, track bug and issue escape rate across releases and campaigns, not to punish but to see if we shipped trust, and track learning velocity with the number of prompts or briefs retired each week because we found a cleaner path, on top of that keep one money line like cost per retained user or profit per order so the craft metrics do not drift, and link them in your doc so folks see the chain from model choice to business outcome.

Privacy sits at the center of this shift, because the new telemetry includes prompts, drafts, and user replies that can leak more than you want, so the measurement plan must include retention windows for model logs, PII redaction at collection not later, and a way to prove consent on every run, and if you use retrieval against your own content, measure source coverage, freshness, and hallucination rate by sampling outputs and asking if the citations actually back the claim, then report data footprint for your AI features the same way you report uptime and latency for your site, because trust will turn into a buying factor in search, in partner reviews, and in legal reviews for bigger deals, and the teams who can show short retention, clear consent, and fast purge will move faster than folks who treat this like fine print.

Tooling should make the invisible visible without turning every day into a spreadsheet, so I like one board that shows AI effort and one board that shows AI results, with input tokens, assist ratio, and review depth on the left, then lift, first pass acceptance, and defect escape on the right, and to keep us honest we always add counter metrics that prevent gaming, if we push for more AI coverage we also watch user complaints and edit time, if we chase speed we also watch returns and refunds, if we chase reach we also watch active subscribers after thirty days, and we teach Goodhart law on day one so nobody builds a plan that would break the product if taken to the extreme, then we hold weekly readouts that are short, show three wins and three fixes, and end with one metric we stop tracking because it no longer earns its keep.

Getting started is not about buying another platform, it is about a short plan you can ship this quarter, pick a product squad and a growth pod, write down five metrics from this page that you can measure with the data you already have, hook them to a shared doc with owners and targets, run two clean AB tests that ask specific questions about model help, turn on a prompt registry and a small consent map, and hold a one hour review each week where you look at the numbers, review five samples of output by hand, and agree on one change to try next, then at day thirty retire one metric, at day sixty ship a public note on how you measure AI in your product or marketing, and at day ninety move this pattern to another team and do it again, because the learning is the product and the scoreboard now runs like a living system, not a trophy case.

Change what you count, or your AI will count you out.

Digital Experience java