§ 01
The Problem I Was Actually Solving
I was mid-development on an agent. Something behaved unexpectedly. To understand what happened, I had to query the logs. Raw logs. Every time.
That feedback loop is brutal when you are iterating fast. And the underlying reason it matters is not a tooling problem — it is a property of the systems we are building. LLM calls are non-deterministic. The same input does not produce the same output. Every time. That is not a bug to route around. It is the nature of the thing. Which means you cannot reason about your agent's behavior from memory or from the last run you remember. You need to see what actually happened on each call.
I am a control freak when it comes to systems I am responsible for. I think that instinct is correct when those systems produce a different result every time they run. So I built a dashboard. With a single prompt. It took twenty minutes. And it taught me more about what I actually needed than I expected.
§ 02
What the Market Already Offers
Before going further: the observability tooling ecosystem is mature. You do not need to build anything.
Langfuse is the open-source benchmark. It traces every call, links multi-step agent workflows into a single hierarchical view, tracks cost and tokens per step, manages prompt versions, and is fully self-hostable for free. It is the tool most serious practitioners reach for first.
Helicone is the fastest path to visibility. Change one line of code — your API base URL — and every request is logged with cost, tokens, latency, and the full request and response. Two minutes from zero to dashboard.
Arize Phoenix is the right choice for teams already running enterprise observability infrastructure on OpenTelemetry. It connects AI call data to the same toolchain your platform team already operates.
LangSmith is purpose-built for LangChain and LangGraph stacks. If that is your architecture, nothing else matches its integration depth.
MLflow covers both classical ML and LLMs under one roof if your team already lives there and wants one platform rather than two.
All of them surface the five fields that matter. I will come back to what those five fields are, and why that is the lens that should drive your evaluation.
§ 03
Build vs. Buy: What Building It Taught Me
I built mine with a single prompt. It was trivially easy. Which is exactly why the build vs. buy decision is not about difficulty — it is about what you are signing up for the moment it works.
Maintenance you did not plan for. Provider pricing changes. New models ship. Token counting logic shifts for reasoning models. Your dashboard does not update itself. The commercial tools do.
You are reinventing hardened infrastructure. Langfuse and Helicone have been built against edge cases you have not hit yet — streaming responses, tool calls, multi-modal inputs, retry logic, timeout handling. A prompt-built dashboard handles the happy path. Production is not the happy path.
It does not scale beyond you. A teammate cannot onboard to your custom build the way they can to Langfuse. No documentation, no community, no support channel. No one else knows how it works.
You are one feature request away from scope creep. The moment you want prompt versioning, eval scores, or alert thresholds, you are building a product instead of shipping one.
The existing tools give you what you need anyway. If the only reason to build is visibility into the five fields — Helicone gives you that in two lines of code. Langfuse gives you that and more, self-hosted, for free.
Would I build it again? Probably not. What I built was a proof of what matters, not a prescription for how to get there. If an existing tool surfaces these five fields cleanly for your stack, use it.
§ 04
The Five Fields That Actually Matter
This is the real output of the exercise. Whatever tool you use — built or bought — these are the five questions it needs to answer for every LLM call:
1. Cost — what did this call cost, calculated from actual token counts against provider pricing. Not estimated. Not aggregated. Per call.
2. Token split — how many tokens went in as input, how many came back as output. The split matters because input and output are priced differently, often by a factor of three to five.
3. Finish reason — why did the model stop. This is the field most developers never look at deliberately, and it is the most important one. stop means natural completion. length means the response was cut off at the token limit — the output is incomplete, and the call still returned HTTP 200. That is a silent failure. content_filter means a safety layer intervened. You will not know any of this is happening unless you surface it.
4. Prompt — exactly what was sent. Not the template. The rendered prompt, with all variables filled, as the model received it.
5. Output — exactly what came back. The full response, not a truncated preview.
Aggregated across sessions, these five fields answer the budget question — which agents, which prompts, which workflows are driving spend — and the quality question — where calls are ending abnormally and why.
§ 05
The Database Layout
If you are building your own logging layer, or evaluating what a tool should capture, this is the minimum schema for an LLM call log:
| Field | Type | Notes |
|---|---|---|
| call_id | UUID | Primary key |
| timestamp | Datetime | When the call was made |
| agent_name | String | Which agent or workflow triggered it |
| prompt_name | String | Named prompt template used |
| prompt_version | String | Version of the prompt |
| model | String | e.g. gpt-4o, claude-3-5-sonnet |
| system_prompt | Text | Full system prompt sent |
| user_input | Text | Full user message sent |
| output | Text | Full model response |
| finish_reason | Enum | stop, length, content_filter, tool_calls |
| input_tokens | Integer | Prompt token count |
| output_tokens | Integer | Completion token count |
| total_tokens | Integer | Sum |
| cost_usd | Float | Calculated from token counts and model pricing |
| latency_ms | Integer | Time to response |
| session_id | String | Groups related calls in a conversation or workflow |
| environment | Enum | dev, staging, production |
| error | Text | Null if successful |
Every agent writes one row per call. The log is append-only. Nothing is deleted — you need the history to see patterns.
§ 06
The Dashboard
Two layers. Neither requires a query language.
Layer one: the call feed. A chronological list of every LLM call with the five core fields visible without clicking through. Filterable by agent, prompt, model, finish reason, and environment. The purpose is forensic — when something behaves unexpectedly, you find the call and read exactly what happened.
Layer two: the aggregated view. Cost by agent, cost by prompt, token distribution over time, finish reason breakdown as a percentage. This is where patterns become visible: a prompt that consistently hits length, a model that costs three times more than expected for a specific task, an agent workflow where 15% of calls are ending on content_filter.
One honest caveat: this works at developer scale — dozens to low hundreds of calls per day. Once you are running hundreds of concurrent users, individual trace inspection becomes as overwhelming as raw logs. At that point you need automated evaluation layered on top — LLM-as-judge scoring, statistical assertions on finish reason distributions, cost anomaly detection. The dashboard becomes the drill-down tool when the automated layer flags something. It is not the whole answer at scale. It is the foundation everything else is built on.
§ 07
Alerts: What a Production Version Needs
A dashboard without alerts is passive. You only see problems when you happen to open it. A production observability system needs to tell you when something requires attention.
The four alert types that matter:
length or content_filter frequency exceeds a threshold. A spike in length means a prompt is consistently hitting the token ceiling. A spike in content_filter means something in your inputs or outputs is triggering guardrails systematically.Alerts are the difference between a monitoring tool and an operations tool.
§ 08
For the Enterprise: What You Should Be Demanding
Individual developers build this visibility because they need it to do their job. Enterprises need it for higher-stakes reasons: multiple agents, multiple teams, real budgets, and real accountability for what the systems do and what they cost.
What enterprise architects should require from every AI deployment:
Every agent writes to a centralised call log. No exceptions. The log captures the five fields as a minimum. The log is surfaced in a shared dashboard accessible to the team building and operating the product — not gated behind engineering access. Alerts are configured before go-live, not after the first cost surprise.
What CIOs and CDOs should be asking:
Do we have call-level visibility into our AI systems? Not system-level uptime. Call-level. Can we answer, for any interaction, what it cost, what the model received, what it returned, and why it stopped? If the answer is no, you are running AI on trust rather than visibility.
The enterprise-grade version of this adds:
Role-based access to the dashboard. Data retention policies aligned with compliance requirements. Audit trails on prompt versions — which version was running when a specific call was made. Integration with existing observability infrastructure so AI call data sits alongside the rest of your operational telemetry, not in a separate system no one checks.
The five fields do not change. The governance layer around them does.
AI without observability is not a production system. It is a black box running on your budget.
§ 09
Implementation: Where to Start
If you are an individual developer or small team:
If you are an enterprise team: