I Built an LLM Observability Dashboard With a Single Prompt. Here Is What It Taught Me About What Actually Matters.

§ 01

The Problem I Was Actually Solving

I was mid-development on an agent. Something behaved unexpectedly. To understand what happened, I had to query the logs. Raw logs. Every time.

That feedback loop is brutal when you are iterating fast. And the underlying reason it matters is not a tooling problem — it is a property of the systems we are building. LLM calls are non-deterministic. The same input does not produce the same output. Every time. That is not a bug to route around. It is the nature of the thing. Which means you cannot reason about your agent's behavior from memory or from the last run you remember. You need to see what actually happened on each call.

I am a control freak when it comes to systems I am responsible for. I think that instinct is correct when those systems produce a different result every time they run. So I built a dashboard. With a single prompt. It took twenty minutes. And it taught me more about what I actually needed than I expected.

§ 02

What the Market Already Offers

Before going further: the observability tooling ecosystem is mature. You do not need to build anything.

Langfuse is the open-source benchmark. It traces every call, links multi-step agent workflows into a single hierarchical view, tracks cost and tokens per step, manages prompt versions, and is fully self-hostable for free. It is the tool most serious practitioners reach for first.

Helicone is the fastest path to visibility. Change one line of code — your API base URL — and every request is logged with cost, tokens, latency, and the full request and response. Two minutes from zero to dashboard.

Arize Phoenix is the right choice for teams already running enterprise observability infrastructure on OpenTelemetry. It connects AI call data to the same toolchain your platform team already operates.

LangSmith is purpose-built for LangChain and LangGraph stacks. If that is your architecture, nothing else matches its integration depth.

MLflow covers both classical ML and LLMs under one roof if your team already lives there and wants one platform rather than two.

All of them surface the five fields that matter. I will come back to what those five fields are, and why that is the lens that should drive your evaluation.

§ 03

Build vs. Buy: What Building It Taught Me

I built mine with a single prompt. It was trivially easy. Which is exactly why the build vs. buy decision is not about difficulty — it is about what you are signing up for the moment it works.

Maintenance you did not plan for. Provider pricing changes. New models ship. Token counting logic shifts for reasoning models. Your dashboard does not update itself. The commercial tools do.

You are reinventing hardened infrastructure. Langfuse and Helicone have been built against edge cases you have not hit yet — streaming responses, tool calls, multi-modal inputs, retry logic, timeout handling. A prompt-built dashboard handles the happy path. Production is not the happy path.

It does not scale beyond you. A teammate cannot onboard to your custom build the way they can to Langfuse. No documentation, no community, no support channel. No one else knows how it works.

You are one feature request away from scope creep. The moment you want prompt versioning, eval scores, or alert thresholds, you are building a product instead of shipping one.

The existing tools give you what you need anyway. If the only reason to build is visibility into the five fields — Helicone gives you that in two lines of code. Langfuse gives you that and more, self-hosted, for free.

Would I build it again? Probably not. What I built was a proof of what matters, not a prescription for how to get there. If an existing tool surfaces these five fields cleanly for your stack, use it.

§ 04

The Five Fields That Actually Matter

This is the real output of the exercise. Whatever tool you use — built or bought — these are the five questions it needs to answer for every LLM call:

1. Cost — what did this call cost, calculated from actual token counts against provider pricing. Not estimated. Not aggregated. Per call.

2. Token split — how many tokens went in as input, how many came back as output. The split matters because input and output are priced differently, often by a factor of three to five.

3. Finish reason — why did the model stop. This is the field most developers never look at deliberately, and it is the most important one. stop means natural completion. length means the response was cut off at the token limit — the output is incomplete, and the call still returned HTTP 200. That is a silent failure. content_filter means a safety layer intervened. You will not know any of this is happening unless you surface it.

4. Prompt — exactly what was sent. Not the template. The rendered prompt, with all variables filled, as the model received it.

5. Output — exactly what came back. The full response, not a truncated preview.

Aggregated across sessions, these five fields answer the budget question — which agents, which prompts, which workflows are driving spend — and the quality question — where calls are ending abnormally and why.

§ 05

The Database Layout

If you are building your own logging layer, or evaluating what a tool should capture, this is the minimum schema for an LLM call log:

Field	Type	Notes
call_id	UUID	Primary key
timestamp	Datetime	When the call was made
agent_name	String	Which agent or workflow triggered it
prompt_name	String	Named prompt template used
prompt_version	String	Version of the prompt
model	String	e.g. gpt-4o, claude-3-5-sonnet
system_prompt	Text	Full system prompt sent
user_input	Text	Full user message sent
output	Text	Full model response
finish_reason	Enum	stop, length, content_filter, tool_calls
input_tokens	Integer	Prompt token count
output_tokens	Integer	Completion token count
total_tokens	Integer	Sum
cost_usd	Float	Calculated from token counts and model pricing
latency_ms	Integer	Time to response
session_id	String	Groups related calls in a conversation or workflow
environment	Enum	dev, staging, production
error	Text	Null if successful

Every agent writes one row per call. The log is append-only. Nothing is deleted — you need the history to see patterns.

§ 06

The Dashboard

Two layers. Neither requires a query language.

Layer one: the call feed. A chronological list of every LLM call with the five core fields visible without clicking through. Filterable by agent, prompt, model, finish reason, and environment. The purpose is forensic — when something behaves unexpectedly, you find the call and read exactly what happened.

Layer two: the aggregated view. Cost by agent, cost by prompt, token distribution over time, finish reason breakdown as a percentage. This is where patterns become visible: a prompt that consistently hits length, a model that costs three times more than expected for a specific task, an agent workflow where 15% of calls are ending on content_filter.

One honest caveat: this works at developer scale — dozens to low hundreds of calls per day. Once you are running hundreds of concurrent users, individual trace inspection becomes as overwhelming as raw logs. At that point you need automated evaluation layered on top — LLM-as-judge scoring, statistical assertions on finish reason distributions, cost anomaly detection. The dashboard becomes the drill-down tool when the automated layer flags something. It is not the whole answer at scale. It is the foundation everything else is built on.

§ 07

Alerts: What a Production Version Needs

A dashboard without alerts is passive. You only see problems when you happen to open it. A production observability system needs to tell you when something requires attention.

The four alert types that matter:

Cost threshold — notify when spend per agent, per prompt, or per day crosses a defined ceiling. Set this before you need it.

Finish reason rate — flag when length or content_filter frequency exceeds a threshold. A spike in length means a prompt is consistently hitting the token ceiling. A spike in content_filter means something in your inputs or outputs is triggering guardrails systematically.

Latency — flag calls that exceed acceptable response time. Latency spikes in LLM calls are often invisible until a user complains.

Error rate — notify when call failures spike above baseline.

Alerts are the difference between a monitoring tool and an operations tool.

§ 08

For the Enterprise: What You Should Be Demanding

Individual developers build this visibility because they need it to do their job. Enterprises need it for higher-stakes reasons: multiple agents, multiple teams, real budgets, and real accountability for what the systems do and what they cost.

What enterprise architects should require from every AI deployment:

Every agent writes to a centralised call log. No exceptions. The log captures the five fields as a minimum. The log is surfaced in a shared dashboard accessible to the team building and operating the product — not gated behind engineering access. Alerts are configured before go-live, not after the first cost surprise.

What CIOs and CDOs should be asking:

Do we have call-level visibility into our AI systems? Not system-level uptime. Call-level. Can we answer, for any interaction, what it cost, what the model received, what it returned, and why it stopped? If the answer is no, you are running AI on trust rather than visibility.

The enterprise-grade version of this adds:

Role-based access to the dashboard. Data retention policies aligned with compliance requirements. Audit trails on prompt versions — which version was running when a specific call was made. Integration with existing observability infrastructure so AI call data sits alongside the rest of your operational telemetry, not in a separate system no one checks.

The five fields do not change. The governance layer around them does.

AI without observability is not a production system. It is a black box running on your budget.

§ 09

Implementation: Where to Start

If you are an individual developer or small team:

Add Helicone to your stack today — change your API base URL, get your key, and you have call logging and cost tracking in under five minutes.

Evaluate whether you need prompt versioning and multi-step trace linking. If yes, move to Langfuse. Self-host it and your data stays in your infrastructure.

Configure finish reason as a visible column in your dashboard view from day one. Do not let it stay buried in metadata.

Set a cost alert threshold before you deploy to production.

If you are an enterprise team:

Decide on your observability tool before your first agent goes to production, not after. Retrofitting is harder than building in.

If you are already on Datadog or a similar enterprise observability stack, evaluate Arize Phoenix — it connects to what you already run.

Define your minimum log schema centrally. Every agent team should write the same fields in the same format. The dashboard is only as useful as the data it receives.

Appoint someone who is accountable for AI spend visibility. Not ownership of the tool — accountability for the number. Someone who looks at cost by agent every week and asks why it changed.

Add automated evaluation as soon as call volume makes manual inspection impractical. LLM-as-judge on a sample of outputs, statistical monitoring of finish reason distributions, anomaly detection on cost. The dashboard stays. Automation is what scales with it.