Skip to main content
Enterprise AI, decodedJanuary 1970

June 7, 2026Opinion

AI Is Already Writing Your Code. Who’s Governing It?

Your developers are already using AI to write code. The governance infrastructure your organization built for human-authored software was not designed for what is happening now — and a policy document is not a substitute for technical controls. This piece introduces the AI Development Governance Framework: four levels that give every board, CIO, and CTO a precise answer to where their organization stands today, what it costs to stay there, and what it takes to move.

18 minCore Agent ArchitectureGovernance Risk & TrustAI Infrastructure & OperationsContext Management & MemorySemantic Layer & Enterprise Semantics

There is a governance gap opening inside most enterprises right now, and it is not showing up in any board report.

It sits between two facts that are both true simultaneously. The first: AI-assisted coding is already happening inside your engineering organization, whether you have sanctioned it or not. The second: the governance infrastructure your organization has built over decades — code review, architecture oversight, change management, security gates, audit trails — was designed for a world where a human being is the author and the accountable party at every step of software development. That world ended quietly, sometime in the last eighteen months, while most leadership teams were still debating adoption policy.

The gap between those two facts is where your next serious technology risk is accumulating.

I am not writing about productivity. Every vendor in this space will tell you about productivity. I am writing about what happens to the systems your business runs on when the process of building and changing those systems is fundamentally altered — and the organizational infrastructure around that process has not kept pace.

The question is no longer whether your developers use AI to write code. Most of them already do. The question is whether your organization can see what is happening, control what it authorizes, and account for it when something goes wrong.

§ 02

What is actually changing — and why it matters to leadership


Software has always been a liability as much as an asset. The difference between organizations that manage that liability well and those that don't has historically come down to process discipline: who reviews code before it ships, who owns architectural decisions, who is accountable when a system fails. Those processes are imperfect, but they rest on a coherent model — a human being made this decision, and we can find out who, why, and what they knew at the time.

AI coding agents break that model at its foundation.

An agent does not carry context about your organization between sessions. It does not know that the utility function it is refactoring is also called by your regulatory reporting pipeline. It does not have professional accountability for the code it produces. It cannot be coached, performance-managed, or held responsible. And it operates at a speed that makes the traditional review cycle designed for human-paced development structurally inadequate.

What this means in practice is that the authorship model underpinning fifty years of software engineering is being renegotiated, at every level of the stack, faster than most organizations have recognized. Your developers are becoming editors and reviewers of AI-generated code. Your code review process is being asked to evaluate volume it was never designed to handle, for failure modes it was never calibrated to catch. Your architecture governance is operating on the assumption that developers carry system context in their heads — an assumption that breaks the moment the developer is reviewing rather than writing.

This is not a developer productivity story. It is a systems governance story. And it belongs on the agenda of every CIO, CTO, and board technology committee that is serious about enterprise risk.

§ 03

The failure modes are already in production


I work across engineering organizations in financial services, automotive, retail, and technology. The pattern I see when AI coding is in use without governance is consistent enough that I can describe it before walking into any given organization.

The codebase is quietly fragmenting. Different developers are prompting AI tools with different context about the same system. One agent session learns that a particular service owns customer identity. Another, run by a different developer the following week, makes a different assumption. Both produce code that passes review. Neither inconsistency is visible until someone attempts a refactor or a significant new feature — at which point they find the codebase has developed multiple contradictory assumptions about how core entities work, none of them documented, none attributable to a specific decision.

High-stakes modules are being modified without authorization awareness. An agent asked to fix a bug will fix it. It may also optimize, restructure, or extend the surrounding code in ways that are locally sensible and have consequences the developer did not anticipate and the agent had no way to know about. The billing engine, the authentication layer, the fraud detection pipeline — none of these are labeled as high-stakes inside the codebase. The agent has no way to know their business significance. The developer, moving at agent-assisted speed, may not catch the scope of what changed. This is the authorization problem stated plainly: the boundary of what an agent is permitted to change should be drawn around business consequence, not file structure. There is a specific and implementable approach to drawing that boundary — one that does not require the agent to self-police.

Technical debt is accumulating faster than any existing process can detect it. AI-generated code is syntactically clean. It passes linters. It often passes tests written to catch human error patterns. It can be architecturally wrong in ways that take months to surface — by which point the agent session that produced it is long gone, the developer has moved on, and there is no record of what the agent knew or didn't know when it made the structural decisions embedded in that code. Compounding this, the model the agent runs on may have changed in the interim — silently, without a release note — producing subtly different outputs from the same prompts. This is a risk that is significantly underestimated in most enterprise AI governance frameworks, and the conventional regression playbook does not address it.

The cumulative picture is an organization whose software delivery system is producing output faster than ever, with less visibility into what is being decided, by whom, on what basis — and with existing oversight mechanisms that were never designed to catch the failure modes this introduces.

That is a board-level risk. It is also, at the moment, almost entirely invisible in board-level reporting.

§ 04

Why the existing response is insufficient


The most common response I see from enterprise leadership when this conversation reaches them is one of three things: a policy restricting AI tool usage, a mandate to use only approved tools, or an instruction to the engineering team to be careful.

None of these are governance. They are the appearance of governance.

A policy restricting AI tool usage does not change the behavior of fifty engineers who have discovered that a coding agent doubles their output. It changes what they tell their manager when asked. An approved-tools mandate creates a controlled channel while leaving the uncontrolled channels in place, because the incentive to use whatever works fastest does not disappear because a policy document says it should. An instruction to be careful places the entire weight of a systemic risk on individual professional judgment — which is precisely the mechanism that AI-assisted speed undermines.

Real governance at this layer is technical and organizational, not documentary. It requires building a substrate — what I call a harness — that shapes what agents know, constrains what they are authorized to do, and creates a record of what they did. It cannot be delegated to a policy. It has to be built.

§ 05

The AI Development Governance Framework: four levels every organization should know


To make this concrete and actionable, I use a framework that maps where any organization sits in terms of its governance posture for AI-assisted software development. It has four levels. Most organizations, when they are honest, are at Level 0.

Level 0 — Unmanaged exposure

What it looks like: Developers are using AI coding tools — sanctioned or unsanctioned — with no organizational infrastructure around those tools. There may be a policy document. There may be a mandate to use only approved platforms. There may be a verbal instruction to "make sure you review what it produces." None of that constitutes governance.

At Level 0, the agent operates with whatever context the individual developer happens to provide in the chat window. There is no shared knowledge of the system, no authorization boundary, no record of what the agent did or what it was told. Every session is isolated. Every developer is operating on their own interpretation of what the agent should and shouldn't touch.

Why this is the default: Because getting to Level 0 requires nothing. No investment, no decision, no infrastructure. It is simply the state that exists when AI tools arrive in an engineering organization and governance does not keep pace.

The honest test: If your developers are using any AI coding tool — Copilot, Cursor, Claude, ChatGPT, anything — and you do not have a queryable index of your architecture, a rules file in your repository, and session-level logging, you are at Level 0. A policy document does not change this. A ban that is not enforced technically does not change this. Asking developers to write a plan before they implement does not change this. These are intentions, not controls.

The risk: At Level 0, you have no visibility, no control, and no record. When something goes wrong — and at scale, something will — you will not be able to explain what happened, attribute it precisely, or demonstrate to a regulator that reasonable controls were in place.

Level 1 — The Minimum Viable Harness

What it looks like: The organization has put in place the three foundational components that make AI-assisted development governable at the session level. This is the floor. Getting here is a two-to-three-week engineering project, not a programme.

Component one: Architecture context, made retrievable. Your architecture documentation — decision records, service READMEs, system design documents, API contracts — is indexed in a vector store and connected to the agent via an MCP server. At the start of every session, the agent queries this index automatically. It knows what system it is operating in before it touches a line of code. An automated pipeline walks the codebase on every merge and generates short structured summaries of each module — what it does, what depends on it, what is sensitive about it — and stores those summaries in the same index. The agent does not infer system structure. It retrieves it.

Component two: A root-level rules file. A single Markdown file, committed to the repository root, that loads automatically into every agent session. It names the system, lists the high-stakes modules that require human approval before modification, specifies the required workflow for any change touching shared components, and instructs the agent to surface uncertainty rather than resolve it unilaterally. Half a day to write. It changes the behavior of every agent session in your codebase from that point forward.

Component three: Session-level logging. A record of what the agent was given, what it planned, who approved it, and what it produced — at the session level, not just the diff level. This is the minimum audit infrastructure. It does not need to be sophisticated. It needs to exist.

Open-source implementation: LlamaIndex or LangChain for ingestion, Qdrant or Chroma as the vector store, tree-sitter for code parsing, the MCP SDK for agent connectivity. No enterprise license. No multi-quarter project.

What Level 1 gives you: Agents operating from shared, accurate system knowledge. Authorization boundaries that are technically enforced, not policy-dependent. A basic audit trail. The most common failure modes — context blindness, unauthorized modification of high-stakes modules, invisible AI-generated debt — are structurally addressed.

What Level 1 does not give you: Workflow enforcement with human approval gates. Model version control. Calibrated evaluation. Graph-based authorization. Those come next.

If you do not have these three components in place, you are not at Level 1. You are at Level 0, regardless of what your AI policy document says.

Level 2 — Governed Development

What it looks like: The harness extends beyond session-level context and rules into workflow discipline and operational reliability. This is the appropriate posture for organizations with more than seventy-five engineers, multiple product lines, or any module with regulatory or revenue significance.

Structured workflow enforcement. The research-plan-implement sequence is not a suggestion — it is a technical requirement. The agent produces a written plan file before any implementation begins, specifying the files it will touch, the rationale for each change, and the expected impact. For any change touching a shared component or a module flagged in the rules file, human approval is an explicit gate. The agent does not proceed without it.

Model version pinning. The model the agent runs on is treated as an external dependency with a change management process. Specific versions are pinned. Upgrades go through an evaluation run before promotion to production. The model provider is managed as an uncontrolled supplier — because that is what they are.

Calibrated evaluation sets. A library of ten to twenty representative coding tasks, drawn from your actual backlog, is run against the agent on a monthly cadence. Any degradation in output quality relative to the baseline is a signal — either the model changed, or the context has drifted. Either way, it is detected before it reaches production rather than after.

What Level 2 gives you: Structural workflow discipline with human oversight at defined decision points. Detection of model drift before it becomes a production problem. The ability to say, with evidence, that your AI-assisted development process has controls — not just policies.

Level 3 — Enterprise-Grade Harness

What it looks like: The harness is a first-class piece of engineering infrastructure, appropriate for organizations with significant regulatory exposure, large concurrent agent deployments, or complex multi-system estates where the business consequence of an unauthorized change is severe.

Graph-based authorization. The code structure graph — which modules call which — is joined to a business consequence graph — which modules underpin which business functions and obligations. Authorization is computed programmatically. The agent's proposed action plan is evaluated against this graph before execution. The boundary of what it is permitted to change is not a list maintained by a human; it is derived from the relationship between code and business impact. This is what authorization by impact radius looks like in practice.

A deterministic enforcement layer. A non-LLM component — a graph-walking evaluator — validates the agent's action plan against authorization policy before implementation begins. It is not probabilistic. It does not hallucinate. It cannot be prompt-injected into approving something it should not. This is the separation of the agent's judgment from the authorization decision that enterprise governance requires.

Enterprise ontology. Canonical definitions of core business concepts — what a customer is, what revenue means, what constitutes a regulated data type — are queryable by agents across the estate. This is the long-term investment that ensures terminological and semantic consistency as the agent population scales. The case for building this, and what it requires, is more substantive than most leadership teams have engaged with.

Per-agent identity and formal audit logging. Every agent has an identity. Every session produces a structured audit trail sufficient for regulatory examination. The retrieval architecture underpinning context delivery is designed for the full estate — a decision about whether to use a vector store, a knowledge graph, or a semantic layer that has long-term architectural implications and should be made with those implications understood, not revisited under pressure eighteen months later.

What Level 3 gives you: A software delivery system in which AI agents operate with the same governance rigor applied to any other enterprise system. Full auditability. Regulatory defensibility. The ability to scale agent usage without scaling governance risk proportionally.

LevelNameTypical organizationWhat governance coversWhat it does not cover
0Unmanaged exposureAny organization using AI coding tools without technical controlsNothing — policy documents and verbal guidance are not controlsEverything
1Minimum viable harnessTeams of 20–75; early AI adoption; limited regulatory exposureContext infrastructure; rules file; session loggingWorkflow enforcement; model version control; graph-based authorization
2Governed developmentTeams of 75–300; multiple product lines; some regulated modulesLevel 1 plus workflow gates; model pinning; evaluation cadenceDeterministic enforcement; ontology; formal audit trail
3Enterprise-grade harnessLarge estates; significant regulatory exposure; concurrent agent deploymentsFull graph-based authorization; deterministic enforcer; ontology; per-agent identity; formal audit logging

§ 06

What a harness actually is — the three components that define Level 1


A harness is not a product you buy. It is an architectural decision about how AI agents are integrated into your software delivery system. The three components that constitute Level 1 are described above, but they are worth restating in terms of what each one prevents.

Context infrastructure prevents the agent from operating on inference. Without it, the agent guesses at system structure based on whatever is in the current session. With it, the agent retrieves accurate, current knowledge of your architecture before touching anything.

Authorization boundaries prevent the agent from making decisions about scope that should be made by a human. Without them, the agent will modify whatever seems relevant. With them, the boundary of what the agent can touch without human approval is technically enforced, not policy-dependent.

Observability prevents the organization from being blind to what the agent did. Without it, there is no record. With it, there is an audit trail that supports investigation, regulatory response, and detection of model degradation over time.

For a fifty-person engineering team, all three are buildable in two to three weeks using open-source components — LlamaIndex or LangChain for document ingestion, Qdrant or Chroma as the vector store, tree-sitter for code parsing, the MCP SDK for agent connectivity. No enterprise software contract required.

§ 07

The leadership question


Most technology risks of this magnitude arrive with obvious signals — a vendor failure, a regulatory action, a public incident. This one is different. The accumulation is internal and gradual. The code looks fine. The developers are more productive. The dashboards show green. The risk is in the structure of what is being built, in the absence of records that would be needed if something went wrong, and in the gap between the speed of AI-assisted development and the speed of the oversight mechanisms around it.

The organizations that are ahead of this are not the ones that moved fastest on AI adoption. They are the ones that recognized, early, that adopting AI coding was not a tooling decision — it was a decision to change how software is authored, reviewed, and governed, and that the organizational infrastructure around software development needed to change with it.

The framework above gives every leadership team a precise answer to the question of where they stand. Level 0 is not a failure of intent. It is simply the state that exists before the work of governance begins. The failure is staying there.

The engineering leaders and boards that are having this conversation now, before the first serious incident, are the ones who will be in a position to answer confidently when they are eventually asked: what did you know, when did you know it, and what did you put in place?

Correspondence

New essays to your desk.

By subscribing you consent to receive our newsletter. Unsubscribe at any time via the link in any email. Privacy Policy.

Sent only when there is something worth reading. Unsubscribe anytime.