Skip to main content
Enterprise AI, decodedJanuary 1970

October 2, 2025Opinion

Why Enterprise AI Drifts — and Why Your Regression Playbook No Longer Protects You

For three decades the rule held: deterministic software gave the same output for the same input, and changed only when you shipped a release or an upstream API broke. Regression testing was an event you scheduled. AI breaks that contract. A model can degrade with zero changes on your side — nobody shipped anything, no API broke, yet last quarter's answers are not the answers you get today. Regression stops being an event and becomes a condition. Ten distinct forces move the system underneath you — probabilistic output, silent vendor model updates, context and input shift, concept drift, agentic compounding — and only one of them, a changed downstream API, is caught by the regression discipline IT leaders spent careers building. The reframe a leader has to make: AI reliability is opex funded from day one, not a build milestone; the hosted model is an uncontrolled supplier to be version-pinned and watched; and the golden evaluation set is a strategic asset, because it is the only thing that proves, with evidence, that the system still does what you bought it for.

9 minAI Infrastructure & OperationsGovernance Risk & TrustTraceability & Explainability

§ 01

The old contract: software only changed when you changed it


For three decades the rule was simple. Deterministic software produced the same output for the same input, every time. It changed under exactly two conditions: you shipped a new release, or an upstream API you consumed changed its contract. Regression testing was therefore an event. You ran the suite at a release gate, or when a dependency published a breaking change. Between those events, a system that passed on Monday passed on Friday. Stability was the default. Change was the exception you scheduled.

Everyone who has run an IT department for the last twenty years built their assurance model on this. Change control, release gates, dependency tracking — all of it assumes the system holds still unless someone moves it.

§ 02

The new contract: the system changes while you stand still


AI systems break that assumption. A model is probabilistic, not deterministic — the same prompt can return different outputs. But probabilism is the least of it. The harder fact is this: an AI system can degrade with zero changes on your side. Nobody shipped a release. No API contract broke. And yet last quarter's answers are no longer the answers you get today.

Regression stops being an event and becomes a condition. The question is no longer "did our change break something" but "is the system still behaving, today, the way it did when we signed it off."

§ 03

What actually moves underneath you


There are ten distinct forces. Only one of them — the last — maps cleanly to the old regression trigger. The other nine are new, and most of them fire without any change-control event you can hang a test run on.

#FactorWhat it isClosest old-world analogyWhy it drifts
1Probabilistic outputThe same input can produce different outputsNone — deterministic code never did thisThe model is statistical; sampling introduces variation by design
2Silent vendor model updatesThe commercial model behind your agent is retrained or replaced by the providerA third-party library force-upgraded in production without a version bump you controlYou call an API, not a binary you pinned; the provider improves the model and your tested behaviour shifts
3Context-layer driftThe documents, retrieval sources, and system prompts the model reads changeReference/config tables another team edits under your batch jobKnowledge bases update continuously; retrieval returns different passages for the same query
4Input distribution shiftThe questions and data customers send change shape over timeSeasonal load — but here the shape changes, not the volumeCustomers learn what the system does well and ask differently; the business mix moves
5Data / world driftThe facts the model reasons over change — prices, products, policiesStale reference dataThe world the model was tested against is not today's world
6Concept driftThe definition of a correct answer changes — new regulation, new product ruleA changed business rule nobody propagated to all systemsGround truth itself moved; the model is still right by yesterday's standard
7Feedback loopsUsers adapt to the system, and its outputs re-enter as future inputsNone in deterministic ITHuman behaviour co-evolves with the tool
8Agentic compoundingIn multi-step agents, small per-step deviations multiplyRounding error accumulating across a long batch chainEach step's output is the next step's input; error compounds along the chain
9Tool / integration driftDownstream tools and APIs the agent calls changeThe classic consumed-API change — this one is familiarSame as before, except an autonomous agent hits it without a human noticing the contract changed
10Internal prompt / config changesYour own teams edit prompts, retrieval settings, model parametersUndocumented config change in productionPrompts are soft, ungoverned, and trivially easy to change

The teaching point for anyone from a traditional IT background: your existing regression discipline catches factor 9 and nothing else. The other nine require a different operating model.

The test for any deployed AI system: ask it a representative business question this week, and again next month, with no deliberate changes in between. If the answer moves and you cannot explain why, you do not have a monitoring gap — you have an unmeasured production system.

§ 04

What this changes operationally


AI reliability is not a build milestone. It is an operating cost. You do not test once and ship. You monitor continuously, evaluate on a schedule, and treat the hosted model as a supply-chain dependency you do not control. A deployment without an evaluation budget is not a deployment — it is a demo that happens to be in production.

§ 05

Best practices and the operating model


Detect

  • Golden dataset (regression bed). A versioned set of representative inputs with approved outputs — the single source of truth for "is it still behaving." Maintained, not frozen: it grows from real traffic, every incident, and new use cases that emerge.
  • Input monitoring. Track the distribution of incoming requests and alert when it diverges from the golden set. This tells you when your evaluation has gone stale and is no longer testing what users actually send.
  • Output quality monitoring. Automated scoring on live traffic, plus human review on a sample.
  • Evaluate

  • Scheduled evaluation, not release-gated. Run the golden set nightly or weekly against the live system — not only when you deploy.
  • Canary / shadow before promotion. Any model, prompt, or retrieval change runs in shadow against live traffic before it is promoted. Nothing reaches production on assertion alone.
  • Govern

  • Vendor change management. Pin model versions where the provider allows. Subscribe to deprecation notices. Re-run the golden set before and after any forced provider update. Keep a rollback path.
  • Prompt and config governance. Prompts are code. Version them, review them, forbid silent edits in production.
  • Observability. Log input, retrieved context, model version, prompt version, output, and confidence for every call. You cannot diagnose drift you never recorded.
  • Named ownership. One function (AI reliability / LLMOps) owns the eval suite, the thresholds, and the response runbook. Drift with no owner is drift nobody fixes.
  • Respond

  • Human-in-the-loop gates where the cost of a wrong answer is high.
  • A drift runbook: defined thresholds, defined escalation, defined rollback — decided before the incident, not during it.
  • § 06

    KPIs for the dashboard


    KPIWhat it tells youHealthy direction
    Golden-set pass rate (trend)Core regression signalFlat or rising
    Input drift score (e.g. PSI / divergence vs golden set)Whether your evaluation still represents realityLow and stable
    Live output quality score (automated + sampled human)Real-world behaviour, not lab behaviourStable or rising
    Hallucination / factual-error rateTrustworthinessFalling
    Task / agent success rate (end-to-end)Whether agents complete the jobStable or rising
    Time-to-detect driftObservability maturityFalling
    Time-to-remediateOperational maturityFalling
    Vendor-version events + post-change pass-rate deltaSupply-chain risk from the model providerTracked; no silent regressions
    Human override / escalation rateWhere the system is weakStable or explained
    Eval coverage (share of live traffic patterns in the golden set)Blind-spot riskRising

    § 07

    What a leader should take away


    1. Reframe the budget. AI systems are not capital projects that finish. The build is a fraction of lifetime cost; continuous evaluation and monitoring is opex you fund from day one.

    2. Treat the commercial model as an uncontrolled supplier. You would never run production on a third-party component that silently rewrites itself. A hosted frontier model is exactly that. Govern it like a supplier — version pinning, change notices, pre/post evaluation, rollback.

    3. The golden dataset is a strategic asset, not test scaffolding. It is the only thing that lets you state, with evidence, that the system still does what you bought it for. It compounds: every incident makes it sharper. Underinvesting here is the most common and most expensive mistake in enterprise AI.

    4. Make "same question, same answer" a board-level reliability metric. If your analytics tool, your agent, and your dashboard disagree, that is a measurable reliability defect — not a quirk to tolerate.

    5. Decide your drift posture before you scale, not after. Retrofitting consistency onto a fleet of agents that have drifted in different directions costs far more than instrumenting one well. The cheapest time to instrument is before the second deployment.

    Correspondence

    New essays to your desk.

    By subscribing you consent to receive our newsletter. Unsubscribe at any time via the link in any email. Privacy Policy.

    Sent only when there is something worth reading. Unsubscribe anytime.