§ 01
The old contract: software only changed when you changed it
For three decades the rule was simple. Deterministic software produced the same output for the same input, every time. It changed under exactly two conditions: you shipped a new release, or an upstream API you consumed changed its contract. Regression testing was therefore an event. You ran the suite at a release gate, or when a dependency published a breaking change. Between those events, a system that passed on Monday passed on Friday. Stability was the default. Change was the exception you scheduled.
Everyone who has run an IT department for the last twenty years built their assurance model on this. Change control, release gates, dependency tracking — all of it assumes the system holds still unless someone moves it.
§ 02
The new contract: the system changes while you stand still
AI systems break that assumption. A model is probabilistic, not deterministic — the same prompt can return different outputs. But probabilism is the least of it. The harder fact is this: an AI system can degrade with zero changes on your side. Nobody shipped a release. No API contract broke. And yet last quarter's answers are no longer the answers you get today.
Regression stops being an event and becomes a condition. The question is no longer "did our change break something" but "is the system still behaving, today, the way it did when we signed it off."
§ 03
What actually moves underneath you
There are ten distinct forces. Only one of them — the last — maps cleanly to the old regression trigger. The other nine are new, and most of them fire without any change-control event you can hang a test run on.
| # | Factor | What it is | Closest old-world analogy | Why it drifts |
|---|---|---|---|---|
| 1 | Probabilistic output | The same input can produce different outputs | None — deterministic code never did this | The model is statistical; sampling introduces variation by design |
| 2 | Silent vendor model updates | The commercial model behind your agent is retrained or replaced by the provider | A third-party library force-upgraded in production without a version bump you control | You call an API, not a binary you pinned; the provider improves the model and your tested behaviour shifts |
| 3 | Context-layer drift | The documents, retrieval sources, and system prompts the model reads change | Reference/config tables another team edits under your batch job | Knowledge bases update continuously; retrieval returns different passages for the same query |
| 4 | Input distribution shift | The questions and data customers send change shape over time | Seasonal load — but here the shape changes, not the volume | Customers learn what the system does well and ask differently; the business mix moves |
| 5 | Data / world drift | The facts the model reasons over change — prices, products, policies | Stale reference data | The world the model was tested against is not today's world |
| 6 | Concept drift | The definition of a correct answer changes — new regulation, new product rule | A changed business rule nobody propagated to all systems | Ground truth itself moved; the model is still right by yesterday's standard |
| 7 | Feedback loops | Users adapt to the system, and its outputs re-enter as future inputs | None in deterministic IT | Human behaviour co-evolves with the tool |
| 8 | Agentic compounding | In multi-step agents, small per-step deviations multiply | Rounding error accumulating across a long batch chain | Each step's output is the next step's input; error compounds along the chain |
| 9 | Tool / integration drift | Downstream tools and APIs the agent calls change | The classic consumed-API change — this one is familiar | Same as before, except an autonomous agent hits it without a human noticing the contract changed |
| 10 | Internal prompt / config changes | Your own teams edit prompts, retrieval settings, model parameters | Undocumented config change in production | Prompts are soft, ungoverned, and trivially easy to change |
The teaching point for anyone from a traditional IT background: your existing regression discipline catches factor 9 and nothing else. The other nine require a different operating model.
The test for any deployed AI system: ask it a representative business question this week, and again next month, with no deliberate changes in between. If the answer moves and you cannot explain why, you do not have a monitoring gap — you have an unmeasured production system.
§ 04
What this changes operationally
AI reliability is not a build milestone. It is an operating cost. You do not test once and ship. You monitor continuously, evaluate on a schedule, and treat the hosted model as a supply-chain dependency you do not control. A deployment without an evaluation budget is not a deployment — it is a demo that happens to be in production.
§ 05
Best practices and the operating model
Detect
Evaluate
Govern
Respond
§ 06
KPIs for the dashboard
| KPI | What it tells you | Healthy direction |
|---|---|---|
| Golden-set pass rate (trend) | Core regression signal | Flat or rising |
| Input drift score (e.g. PSI / divergence vs golden set) | Whether your evaluation still represents reality | Low and stable |
| Live output quality score (automated + sampled human) | Real-world behaviour, not lab behaviour | Stable or rising |
| Hallucination / factual-error rate | Trustworthiness | Falling |
| Task / agent success rate (end-to-end) | Whether agents complete the job | Stable or rising |
| Time-to-detect drift | Observability maturity | Falling |
| Time-to-remediate | Operational maturity | Falling |
| Vendor-version events + post-change pass-rate delta | Supply-chain risk from the model provider | Tracked; no silent regressions |
| Human override / escalation rate | Where the system is weak | Stable or explained |
| Eval coverage (share of live traffic patterns in the golden set) | Blind-spot risk | Rising |
§ 07
What a leader should take away
1. Reframe the budget. AI systems are not capital projects that finish. The build is a fraction of lifetime cost; continuous evaluation and monitoring is opex you fund from day one.
2. Treat the commercial model as an uncontrolled supplier. You would never run production on a third-party component that silently rewrites itself. A hosted frontier model is exactly that. Govern it like a supplier — version pinning, change notices, pre/post evaluation, rollback.
3. The golden dataset is a strategic asset, not test scaffolding. It is the only thing that lets you state, with evidence, that the system still does what you bought it for. It compounds: every incident makes it sharper. Underinvesting here is the most common and most expensive mistake in enterprise AI.
4. Make "same question, same answer" a board-level reliability metric. If your analytics tool, your agent, and your dashboard disagree, that is a measurable reliability defect — not a quirk to tolerate.
5. Decide your drift posture before you scale, not after. Retrofitting consistency onto a fleet of agents that have drifted in different directions costs far more than instrumenting one well. The cheapest time to instrument is before the second deployment.