Prompts in Code Is a Trap. Here Are Two Ways Out.

§ 01

I Blew My Budget Because I Hardcoded a Model

One agent. One provider. Anthropic. Prompts in the codebase. Everything worked.

Then I needed to process images at scale. Anthropic's vision pricing accumulated faster than I noticed. When the bill arrived, I understood something I had not fully grasped before: models are not just a technical choice. They are a cost lever. And when that lever is hardcoded, you cannot pull it without a deployment.

I added OpenAI for image tasks. Then DeepSeek for better Asian language context. Three providers, three API keys, three billing relationships, and routing logic scattered across my codebase. Every model change required touching code. Every prompt edit meant a commit, a review, a deploy.

That is when I built two things: a prompt management system, and two different approaches to model selection. Both as POCs, on a live agent, under real budget pressure. This is what I learned — including whether I would build it again or just use what already exists.

This article moves through both problems in order. First prompt management — getting prompts out of code and under control. Then model selection — how to decide which model runs each prompt, and the three architectures for doing it.

§ 02

Part One: Prompt Management

§ 03

Prompts in Code Is a Trap

It feels manageable at first. A string in a config file. A constant at the top of a function. You can read it, change it, ship it.

Here is what you are actually signing up for.

Every edit is a deployment. A wording change to a system prompt requires a commit, a review, a merge, a deploy. For something you iterate on constantly — especially early in development — this cycle kills momentum.

You cannot see them all at once. Prompts scattered across files have no single view. You do not know how many there are, what each does, which model each runs against, or when any of them last changed.

There is no rollback. A prompt change breaks something in production. Rolling back means reverting code and redeploying — if you caught it quickly, if you knew which change caused it.

Fine-tuning a prompt is also about finding the right model. This is the insight that usually gets left out. When you are iterating on a prompt, you are not just editing words — you are asking: does this task need this model? Could something cheaper handle it just as well? You cannot answer that efficiently when the model is hardcoded and changing it requires a deployment. This is why prompt management and model selection are two halves of one problem — and why this article covers both.

Non-engineers cannot contribute. Product instincts are often sharper than engineering instincts on how instructions should be phrased. Prompts in code are invisible to everyone outside the repository.

The fix is simple in principle: take prompts out of code. They live in a registry, editable through a UI, fetched at runtime. A prompt change becomes a config change. The agent picks it up without a release.

§ 04

What Prompt Management Actually Requires

A registry. Every prompt used by every agent, in one place. Name, version, agent, model or tier, last edited, last deployed, call volume. One row per prompt. The full picture in a single view.

An editor with version control. Open any prompt, edit it, save it as a new version. Every save is versioned. Diff any two versions. Promote to production in one action. Roll back to any prior version in under a minute without touching code.

A playground. The piece most implementations skip and then regret. Run a prompt against a live model — or multiple models side by side — before promoting it. Type a test input, fire the call, see output, token count, and cost. Compare gpt-4o-mini against claude-3-5-sonnet on the same input. The playground is where you discover that a task you assumed needed your most expensive model works just as well at a tenth of the cost. Without it, you are deploying blind.

Runtime fetch with fallback. Your agent stores a prompt name, not a prompt. It calls the registry at startup, fetches the current production version, and caches it. When you update in the UI, the agent picks it up on the next fetch cycle without a release. Critical caveat: always keep a local fallback copy of every prompt bundled with your deployment. If your registry goes down or is unreachable, your agent needs to function. A registry outage should not be a production incident.

Prompt evals before promotion. A prompt change without testing is as risky as a code change without testing. Before any prompt goes to production, run it against a standard set of test inputs and validate outputs. An eval suite does not need to be elaborate — even a small set of representative inputs with expected output characteristics catches the majority of regressions. Changing a prompt in a UI and deploying it without evals is how silent quality degradations reach production.

§ 05

The Tools That Already Do This

Langfuse is the most complete open-source option. Versioning, SDK fetch, playground, production promotion, rollback — all built in, and connected to your call traces so you can see which prompt version was running on any historical call. Self-hosted and free.

PromptLayer is purpose-built for prompt registries. Git-style versioning, visual editor, strong collaboration features, well-implemented playground.

LangSmith if you are already in the LangChain ecosystem. Prompt versioning is a natural extension of its tracing infrastructure.

For the prompt management half of the problem, these tools are mature and there is little reason to build your own. I will return to the build-vs-buy question once the model selection picture is complete — because that is where the one genuine gap lives.

§ 06

Part Two: Model Selection

The registry stores which model each prompt runs against. But how do you decide what goes in that field? There are two approaches, and a third scenario that reframes the choice entirely.

§ 07

Approach 1: Pick the Provider and Model Directly

Each prompt has an explicit provider and model setting. Anthropic / claude-3-5-sonnet. OpenAI / gpt-4o-mini. You change it in the UI, and the next call uses the new model.

	Approach 1: Direct Selection
Best for	Small teams, precise per-prompt optimisation, aggregator users
Pros	Full visibility into exactly what runs. Playground comparisons are exact. Cost differences between models are explicit in your call log.
Cons	You need an API key for every provider. Staying current on every provider's model lineup is on you. Manual updates when models deprecate.

When it makes sense: you have one or two providers and want precision control. Or you are using an aggregator as your access layer — more on that shortly.

The LiteLLM caveat most people miss: LiteLLM gives you a unified SDK across 100+ models, which makes it look like the key-management problem is solved. It is not. LiteLLM abstracts the SDK. It does not abstract the commercial relationships. You still need an Anthropic key to call Anthropic models, an OpenAI key for OpenAI, a DeepSeek key for DeepSeek. The API surface is unified. The billing is not.

§ 08

Approach 2: Pick the Capability Tier, Let the System Route

This is what I switched to when managing three providers started to feel like a job. Each prompt specifies a tier — Fast, Balanced, or Best. A mapping table resolves the tier to the best available model from whichever providers you have configured.

Tier	Resolves to
Fast	deepseek-chat, gpt-4o-mini, or claude-haiku — whichever provider is configured
Balanced	gpt-4o or claude-3-5-sonnet
Best	Top-tier model from your configured providers

Update the mapping when a better model ships. Every prompt using that tier picks it up automatically. When DeepSeek released a stronger model for Asian language tasks, I changed one line. No prompt-level changes.

	Approach 2: Tier-Based Routing
Best for	Multiple direct provider relationships, BYO-LLM products, insulating from provider churn
Pros	One decision per prompt — what capability does this need. Single-point model updates. Provider-agnostic by design.
Cons	Loss of per-prompt precision. The mapping table needs an owner who updates it as providers ship new models. Playground testing is less exact since routing resolves at runtime.

When this is essential, not just convenient:

You have direct relationships with multiple providers and no aggregator. Three keys, three billing accounts, your own routing layer. Approach 2 is the abstraction that makes that manageable.

You are building a product where customers bring their own LLM. This is not optional — it is the architecture. A customer connecting their own Anthropic key, their own Azure OpenAI deployment, or a self-hosted model cannot get hardcoded model names. The tier system is what makes the product work regardless of which providers the customer brings.

§ 09

The Third Scenario: LLM Aggregators

Both approaches above assume you hold the provider relationships. Aggregators change that — and they reframe when approach 1 is actually the clean choice.

An aggregator like OpenRouter, or a managed LiteLLM instance, works differently from direct relationships. You subscribe to the aggregator. One API key. One budget, drawn down across any model from any provider you call through their endpoint. One billing relationship, one invoice, one key in your codebase.

This is materially different from running the LiteLLM open-source SDK yourself — where you still hold every provider key and every billing account, and LiteLLM only unifies the API surface. An aggregator takes the commercial layer off your plate entirely.

On an aggregator, approach 1 becomes genuinely clean. You specify exact model names because the aggregator handles the provider relationship behind the scenes. No separate keys, no separate billing accounts. You switch models by changing a string. OpenRouter currently offers 200+ models from 60+ providers this way — the fastest path from zero to multi-model access with no infrastructure to manage.

The tradeoffs you are accepting:

Single point of failure — if the aggregator has an outage, you lose every model at once, not just one provider. A pricing markup that is negligible at low volume and a real cost line at enterprise scale. Your data flows through the aggregator's infrastructure before reaching the provider, which can fail a security review or violate data residency requirements. And you cannot use any volume pricing you have negotiated directly with a provider.

The full picture is three scenarios:

Scenario	Architecture	Key management
Using an aggregator (OpenRouter, managed LiteLLM)	Approach 1 — exact model names	One key, one budget, one relationship
Direct provider relationships, no aggregator	Approach 2 — tier-based routing	Multiple keys and billing accounts, your own routing
BYO-LLM product	Approach 2 — mandatory	Customer-configured providers, tier mapping absorbs the variation

For small teams without data residency constraints: an aggregator is the simplest path. Start there. Move to direct relationships if you hit a compliance wall or the markup becomes material. For enterprises: security and compliance usually decide the aggregator question before you reach architecture. If aggregators are allowed, they simplify everything. If not, approach 2 is your architecture.

§ 010

So Would I Build Any of This? Mostly No.

Now that all the pieces are defined, the build-vs-buy answer is precise.

Prompt management — registry, editor, versioning, playground: buy. Use Langfuse or PromptLayer. They are mature, well-maintained, and better than anything you will build in a weekend. Langfuse is self-hosted and free. There is no compelling reason to build these from scratch. I built mine with a single prompt and would not do it again — the moment you own it, you own the maintenance, the edge cases, and the onboarding burden for everyone after you.

Model routing — the tier layer: build, but only this. Nothing in the market handles tier-based selection cleanly for the scenarios where it matters — selective provider relationships and BYO-LLM products. But this piece is small. It is a mapping table and a resolver function that sits on top of your prompt management tool. Build that one component. Buy everything around it.

That is the honest answer: buy the platform, build the thin routing layer the platform does not provide.

§ 011

Putting It Into Practice

§ 012

The Dashboard: Four Views

Prompt library. Every prompt in the system. Name, agent, version, model or tier, last edited, last deployed, call volume, cost per call. The operational view — open it when something behaves unexpectedly.

Editor and playground. Prompt text, version history, diff view, variable list. Below that: a test input field, a model selector, a Run button. Output, token count, and cost inline. Run the same input across multiple models to compare before promoting.

Model configuration. Approach 2 only. Configured providers, active keys, current tier mapping, last updated per tier. The single place where routing logic lives.

Usage and performance. Cost per prompt, call volume, finish reason distribution, average tokens. Connected to the call log — every entry links to a prompt name and version so you can filter your observability dashboard by prompt and see how it performs over time.

§ 013

For the Enterprise

Question	What the answer tells you
"How are prompts managed?"	If they are in the codebase, you have an unaudited control surface for production AI behaviour
"Can we switch providers without a code release?"	If no, you are one outage or pricing change away from an emergency deployment
"Are we locked to specific model names?"	Every hardcoded model string is technical debt with a ticking expiration date
"What is our prompt testing process?"	If there is no eval suite, prompt changes are reaching production untested

What architects should require: prompts do not live in code. Every production prompt is versioned, attributable to a person, and rollbackable in under five minutes without a deployment. Model or tier selection is recorded in every call log entry.

For product teams building with customer-configurable LLMs: approach 2 is the architecture. Define your tiers, build the mapping table, and make provider configuration a first-class feature from day one — not something bolted on when your first enterprise customer asks why they cannot use their own Azure deployment.

A prompt is the instruction set that determines what your AI system does. It needs version control, it needs testing, and it needs to be changeable without an engineering event.