Why Enterprise AI Drifts: 10 Forces Behind Silent Decay

§ 01

The old contract: software only changed when you changed it

For three decades the rule was simple. Deterministic software produced the same output for the same input, every time. It changed under exactly two conditions: you shipped a new release, or an upstream API you consumed changed its contract. Regression testing was therefore an event. You ran the suite at a release gate, or when a dependency published a breaking change. Between those events, a system that passed on Monday passed on Friday. Stability was the default. Change was the exception you scheduled.

Everyone who has run an IT department for the last twenty years built their assurance model on this. Change control, release gates, dependency tracking — all of it assumes the system holds still unless someone moves it.

§ 02

The new contract: the system changes while you stand still

AI systems break that assumption. A model is probabilistic, not deterministic — the same prompt can return different outputs. But probabilism is the least of it. The harder fact is this: an AI system can degrade with zero changes on your side. Nobody shipped a release. No API contract broke. And yet last quarter's answers are no longer the answers you get today.

Regression stops being an event and becomes a condition. The question is no longer "did our change break something" but "is the system still behaving, today, the way it did when we signed it off."

§ 03

What actually moves underneath you

There are ten distinct forces. Only one of them — the last — maps cleanly to the old regression trigger. The other nine are new, and most of them fire without any change-control event you can hang a test run on.

#	Factor	What it is	Closest old-world analogy	Why it drifts
1	Probabilistic output	The same input can produce different outputs	None — deterministic code never did this	The model is statistical; sampling introduces variation by design
2	Silent vendor model updates	The commercial model behind your agent is retrained or replaced by the provider	A third-party library force-upgraded in production without a version bump you control	You call an API, not a binary you pinned; the provider improves the model and your tested behaviour shifts
3	Context-layer drift	The documents, retrieval sources, and system prompts the model reads change	Reference/config tables another team edits under your batch job	Knowledge bases update continuously; retrieval returns different passages for the same query
4	Input distribution shift	The questions and data customers send change shape over time	Seasonal load — but here the shape changes, not the volume	Customers learn what the system does well and ask differently; the business mix moves
5	Data / world drift	The facts the model reasons over change — prices, products, policies	Stale reference data	The world the model was tested against is not today's world
6	Concept drift	The definition of a correct answer changes — new regulation, new product rule	A changed business rule nobody propagated to all systems	Ground truth itself moved; the model is still right by yesterday's standard
7	Feedback loops	Users adapt to the system, and its outputs re-enter as future inputs	None in deterministic IT	Human behaviour co-evolves with the tool
8	Agentic compounding	In multi-step agents, small per-step deviations multiply	Rounding error accumulating across a long batch chain	Each step's output is the next step's input; error compounds along the chain
9	Tool / integration drift	Downstream tools and APIs the agent calls change	The classic consumed-API change — this one is familiar	Same as before, except an autonomous agent hits it without a human noticing the contract changed
10	Internal prompt / config changes	Your own teams edit prompts, retrieval settings, model parameters	Undocumented config change in production	Prompts are soft, ungoverned, and trivially easy to change

The teaching point for anyone from a traditional IT background: your existing regression discipline catches factor 9 and nothing else. The other nine require a different operating model.

The test for any deployed AI system: ask it a representative business question this week, and again next month, with no deliberate changes in between. If the answer moves and you cannot explain why, you do not have a monitoring gap — you have an unmeasured production system.

§ 04

What this changes operationally

AI reliability is not a build milestone. It is an operating cost. You do not test once and ship. You monitor continuously, evaluate on a schedule, and treat the hosted model as a supply-chain dependency you do not control. A deployment without an evaluation budget is not a deployment — it is a demo that happens to be in production.

§ 05

Best practices and the operating model

Detect

Golden dataset (regression bed). A versioned set of representative inputs with approved outputs — the single source of truth for "is it still behaving." Maintained, not frozen: it grows from real traffic, every incident, and new use cases that emerge.

Input monitoring. Track the distribution of incoming requests and alert when it diverges from the golden set. This tells you when your evaluation has gone stale and is no longer testing what users actually send.

Output quality monitoring. Automated scoring on live traffic, plus human review on a sample.

Evaluate

Scheduled evaluation, not release-gated. Run the golden set nightly or weekly against the live system — not only when you deploy.

Canary / shadow before promotion. Any model, prompt, or retrieval change runs in shadow against live traffic before it is promoted. Nothing reaches production on assertion alone.

Govern

Vendor change management. Pin model versions where the provider allows. Subscribe to deprecation notices. Re-run the golden set before and after any forced provider update. Keep a rollback path.

Prompt and config governance. Prompts are code. Version them, review them, forbid silent edits in production.

Observability. Log input, retrieved context, model version, prompt version, output, and confidence for every call. You cannot diagnose drift you never recorded.

Named ownership. One function (AI reliability / LLMOps) owns the eval suite, the thresholds, and the response runbook. Drift with no owner is drift nobody fixes.

Respond

Human-in-the-loop gates where the cost of a wrong answer is high.

A drift runbook: defined thresholds, defined escalation, defined rollback — decided before the incident, not during it.

§ 06

KPIs for the dashboard

KPI	What it tells you	Healthy direction
Golden-set pass rate (trend)	Core regression signal	Flat or rising
Input drift score (e.g. PSI / divergence vs golden set)	Whether your evaluation still represents reality	Low and stable
Live output quality score (automated + sampled human)	Real-world behaviour, not lab behaviour	Stable or rising
Hallucination / factual-error rate	Trustworthiness	Falling
Task / agent success rate (end-to-end)	Whether agents complete the job	Stable or rising
Time-to-detect drift	Observability maturity	Falling
Time-to-remediate	Operational maturity	Falling
Vendor-version events + post-change pass-rate delta	Supply-chain risk from the model provider	Tracked; no silent regressions
Human override / escalation rate	Where the system is weak	Stable or explained
Eval coverage (share of live traffic patterns in the golden set)	Blind-spot risk	Rising

§ 07

What a leader should take away

1. Reframe the budget. AI systems are not capital projects that finish. The build is a fraction of lifetime cost; continuous evaluation and monitoring is opex you fund from day one.

2. Treat the commercial model as an uncontrolled supplier. You would never run production on a third-party component that silently rewrites itself. A hosted frontier model is exactly that. Govern it like a supplier — version pinning, change notices, pre/post evaluation, rollback.

3. The golden dataset is a strategic asset, not test scaffolding. It is the only thing that lets you state, with evidence, that the system still does what you bought it for. It compounds: every incident makes it sharper. Underinvesting here is the most common and most expensive mistake in enterprise AI.

4. Make "same question, same answer" a board-level reliability metric. If your analytics tool, your agent, and your dashboard disagree, that is a measurable reliability defect — not a quirk to tolerate.

5. Decide your drift posture before you scale, not after. Retrofitting consistency onto a fleet of agents that have drifted in different directions costs far more than instrumenting one well. The cheapest time to instrument is before the second deployment.

Why Enterprise AI Drifts — and Why Your Regression Playbook No Longer Protects You

The old contract: software only changed when you changed it

The new contract: the system changes while you stand still

What actually moves underneath you

What this changes operationally

Best practices and the operating model

KPIs for the dashboard

What a leader should take away

New essays to your desk.