Observability & monitoring
Offline golden evals prove your bot passed yesterday’s test suite. Observability proves it is still safe, faithful, and affordable in production today. This guide closes the eval loop: structured per-request logs, LangSmith and Langfuse traces, Helicone cost proxies, SLI dashboards tied to guardrail block rates, online eval sampling, drift detection, and the Track 5 observability checklist.
After reading, you should be able to: explain why offline evals are necessary but insufficient; emit a production-safe JSON log schema per LLM request; integrate LangSmith or Langfuse for traces and feedback; route inference through Helicone for cost and latency telemetry; define SLIs and alert thresholds for refusal rate, faithfulness, latency, errors, and cost; sample 1–5% of traffic for shadow scoring; promote production failures to golden datasets; and ship with the Track 5 observability production checklist.
Observability explained: why offline evals are not enough
Offline evals answer “did we pass the golden set we wrote?” Online observability answers “is production still healthy on traffic we never imagined?” This section contrasts offline vs online evaluation, maps the three pillars for LLM apps—traces, metrics, logs—and closes the eval loop from production failures back to gold.
Your refund copilot passes 200 golden cases in CI on Tuesday. Wednesday morning support volume doubles and pass-at-a-glance quality craters—same deploy, same prompt hash, same model string in config. Traditional APM shows green p99 latency and zero 5xx errors because every request returned HTTP 200 with a plausible paragraph. LLM observability exists because correctness is probabilistic, spend is per-token, and failure modes hide inside successful responses. Offline evals are necessary; they are not sufficient.
Offline vs online evaluation
| Dimension | Offline eval (CI / nightly) | Online observability (prod) |
|---|---|---|
| Traffic | Frozen golden JSONL | Live user queries, seasonal spikes, attacks |
| Latency | Batch, unbounded wall time | TTFT p95, queue depth, user-facing SLOs |
| Cost | Budgeted job; capped N cases | Per-request token spend; runaway agent loops |
| Correctness | Pass rate vs rubric | Sampled judges, thumbs, escalations, drift |
| Feedback loop | Human updates gold before merge | Auto-promote failures → gold within days |
| Guardrails | Adversarial gold in CI | Block-rate dashboards, injection spikes |
Three pillars for LLM apps
Standard observability stacks still apply—but LLM apps need pillar-specific fields. Traces waterfall retrieval, tool calls, guardrail stages, and generation spans under one trace_id. Metrics aggregate refusal rate, faithfulness proxy, P95 latency, error rate, and cost per request. Logs emit structured JSON per request (next section) for audit, billing reconciliation, and incident replay without storing raw PII.
| Pillar | LLM-specific signal | Example tool |
|---|---|---|
| Traces | Span per retrieve / tool / generate / guardrail | LangSmith, Langfuse, OTel |
| Metrics | TTFT p95, tokens_in/out, cost_usd, block_rate | Prometheus, Datadog, Grafana |
| Logs | prompt_version, retrieval_ids, guardrail_decisions | JSON → Loki / CloudWatch / ELK |
flowchart LR
subgraph offline["Offline eval"]
G[Golden JSONL] --> CI[CI / nightly suite]
CI -->|pass| Ship[Deploy]
end
subgraph online["Online observability"]
Prod[Production traffic] --> Log[Structured logs + traces]
Log --> Dash[Dashboards + alerts]
Dash --> Sample[Sample 1-5% shadow scoring]
end
Sample -->|failure| Review[Human review queue]
Review --> G
Dash -->|drift alert| Rollback[Prompt rollback / feature flag]
Closing the eval loop: production → gold
Mature teams treat production surprises as dataset candidates, not one-off tickets. A faithfulness alert fires → on-call opens trace exemplars → PM confirms bad answer → row added to gold/refund_policy_v3.jsonl within 48h → CI blocks the next regression. Observability without this loop produces pretty dashboards that never change prompts.
Non-deterministic outputs
Identical messages[], temperature 0, and model ID can still diverge across provider patches, batching order, and floating-point nondeterminism on some GPUs. You cannot assert response == expected in production—you score distributions, sample traces, and compare embedding centroids week over week. Offline evals give point estimates; online observability tracks drift of those estimates on live traffic slices.
- Same input, different answer — user retries after a vague reply and gets contradictory policy guidance.
- Silent model swap — provider routes gpt-4o-2024-08-06 to a newer snapshot without a version bump in your config.
- Tool nondeterminism — agent calls search twice; second result set changes the final refund decision.
Latency variance (TTFT vs total)
LLM latency is bimodal: time-to-first-token (TTFT) reflects queue depth and prompt size; total latency adds variable decode length. A p99 total latency alert fires during a marketing push not because GPUs failed but because users ask longer questions and models emit longer answers. Split SLIs into ttft_ms, decode_ms, and tokens_out—otherwise you chase the wrong knob (scale replicas when you need max_tokens caps).
| Signal | Traditional API | LLM call | What to chart |
|---|---|---|---|
| Success | HTTP 2xx | HTTP 2xx + empty/refusal/hallucination | Semantic pass rate sample |
| Latency | Single wall time | TTFT + decode + tool round-trips | TTFT p50/p95, total p95 |
| Payload | Fixed JSON schema | Variable tokens in/out | Tokens per feature, context fill ratio |
| Errors | Exceptions, 5xx | Rate limits, context overflow, policy block | Typed error codes per vendor |
| Correctness | Deterministic tests | Judge scores, user thumbs, task success | Weekly drift vs golden centroid |
Cost unpredictability
A “simple” FAQ bot becomes expensive when retrieval returns 40 chunks, the model cites each, and a follow-up rewrites the whole thread into context. Cost scales with tokens in + tokens out + tool calls + judge evals, not QPS alone. Finance sees a 3× bill spike while traffic is flat—you need per-feature, per-tenant, per-model attribution in logs, not an aggregate CloudWatch billing alarm two days late.
Cost drivers to attribute per request
| Driver | Log field | Typical surprise |
|---|---|---|
| Prompt bloat | tokens_in | System prompt duplicated per turn |
| Runaway decode | tokens_out, finish_reason | Missing max_tokens on tool loop |
| Retry storm | attempt | 429 backoff not capped |
| Judge tax | eval_judge_tokens | 100% online LLM-as-judge in prod |
| Embedding refresh | embedding_tokens | Re-embed on every query |
Quality drift without deploys
Index rot, competitor docs changing, seasonal intents, and adversarial prompt patterns shift quality while git SHA is frozen. RAG retrieval recall drops when a PDF update splits tables across chunks—answers sound confident and wrong. Observability connects retrieval metrics (hit rate, MRR@k), generation metrics (faithfulness sample), and product metrics (escalation rate, CSAT) on shared request_id so you know which layer drifted.
import hashlib
from dataclasses import dataclass, asdict
@dataclass
class LlmCallMeta:
request_id: str
model: str
provider: str
prompt_version: str
temperature: float
seed: int | None
tokens_in: int
tokens_out: int
finish_reason: str
response_hash: str # sha256 of normalized text
def normalize(text: str) -> str:
return " ".join(text.split()).lower()
def record_completion(req_id: str, model: str, text: str, usage, **kwargs) -> LlmCallMeta:
meta = LlmCallMeta(
request_id=req_id,
model=model,
provider=kwargs.get("provider", "openai"),
prompt_version=kwargs["prompt_version"],
temperature=kwargs.get("temperature", 0.0),
seed=kwargs.get("seed"),
tokens_in=usage.prompt_tokens,
tokens_out=usage.completion_tokens,
finish_reason=kwargs.get("finish_reason", "stop"),
response_hash=hashlib.sha256(normalize(text).encode()).hexdigest()[:16],
)
logger.info("llm_completion", extra=asdict(meta))
return meta
public record LlmCallMeta(
String requestId, String model, String provider, String promptVersion,
double temperature, Integer seed, int tokensIn, int tokensOut,
String finishReason, String responseHash) {}
public LlmCallMeta recordCompletion(String reqId, String model, String text, Usage usage, Map kwargs) {
String hash = sha256(normalize(text)).substring(0, 16);
var meta = new LlmCallMeta(reqId, model, (String) kwargs.getOrDefault("provider", "openai"),
(String) kwargs.get("prompt_version"), (double) kwargs.getOrDefault("temperature", 0.0),
(Integer) kwargs.get("seed"), usage.promptTokens(), usage.completionTokens(),
(String) kwargs.getOrDefault("finish_reason", "stop"), hash);
structuredLog.info("llm_completion", meta);
return meta;
}
import time
from contextlib import contextmanager
@contextmanager
def llm_stream_timers(request_id: str):
marks = {"start": time.perf_counter(), "first_token": None}
yield marks
total_ms = (time.perf_counter() - marks["start"]) * 1000
ttft_ms = (marks["first_token"] - marks["start"]) * 1000 if marks["first_token"] else total_ms
decode_ms = total_ms - ttft_ms
metrics.timing("llm.ttft_ms", ttft_ms, tags=[f"req:{request_id[:8]}"])
metrics.timing("llm.decode_ms", decode_ms)
metrics.timing("llm.total_ms", total_ms)
# In streaming handler:
# marks["first_token"] = time.perf_counter() # on first SSE chunk
public class LlmStreamTimers implements AutoCloseable {
private final long start = System.nanoTime();
private Long firstTokenNs = null;
public void onFirstToken() {
if (firstTokenNs == null) firstTokenNs = System.nanoTime();
}
@Override
public void close() {
long totalMs = (System.nanoTime() - start) / 1_000_000;
long ttftMs = firstTokenNs == null ? totalMs : (firstTokenNs - start) / 1_000_000;
metrics.timing("llm.ttft_ms", ttftMs);
metrics.timing("llm.decode_ms", totalMs - ttftMs);
metrics.timing("llm.total_ms", totalMs);
}
}
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class DriftMonitor:
def __init__(self, baseline_vectors: np.ndarray, threshold: float = 0.92):
self.centroid = baseline_vectors.mean(axis=0)
self.threshold = threshold
def score_batch(self, live_vectors: np.ndarray) -> dict:
live_centroid = live_vectors.mean(axis=0)
sim = float(cosine_similarity([self.centroid], [live_centroid])[0, 0])
return {"centroid_similarity": sim, "drift_alert": sim < self.threshold}
# Run nightly on sampled answer embeddings; page if drift_alert.
public record DriftScore(double centroidSimilarity, boolean driftAlert) {}
public DriftScore scoreBatch(double[][] liveVectors, double[] baselineCentroid, double threshold) {
double[] liveCentroid = meanRows(liveVectors);
double sim = cosineSimilarity(baselineCentroid, liveCentroid);
return new DriftScore(sim, sim < threshold);
}
Alerting on exact string match for regression—flaky and meaningless with temperature > 0. Use semantic similarity bands, judge scores, or task-success labels instead.
Provider TTFT includes queue time in shared multi-tenant clusters; your p95 TTFT spike may be vendor-side congestion, not your prompt doubling. Always log provider_request_id for support tickets.
"Why is LLM observability hard?" — Non-deterministic correctness, token-based cost, TTFT+decode latency split, silent model/index drift, failures masked as 200 OK. Mention trace spans per tool call and sampling for judges.
FAQ: Do we still need Datadog?
Yes—LLM traces complement APM. Keep JVM/Python service metrics in Datadog; export LLM spans to Langfuse or OpenTelemetry with gen_ai.* semantic conventions. Correlate via trace_id.
FAQ: Is temperature 0 enough for stability?
It reduces variance; it does not guarantee bitwise-identical outputs or protect against retrieval/index drift. Still log hashes and run online samples.
Logging schema: structured JSON per request
Structured logging is the foundation of LLM observability. Each completion emits one JSON row with trace_id, request_id, model, prompt_version, latency_ms, token counts, retrieval_ids, tool_calls, guardrail_decisions, and cost_usd—never raw PII.
Every LLM call—chat, embed, rerank, judge, moderation—should emit one structured log row and one trace span with the same request_id. If you cannot replay what the model saw, you cannot debug hallucinations, cost spikes, or guardrail false positives. Treat logs as a contract between app eng, ML, and platform: schema in git, enforced in code review, validated in staging.
Canonical fields per LLM call
| Field | Required? | Example | Why |
|---|---|---|---|
| request_id | Yes | req_8f2a | Join app, RAG, guardrails, support ticket |
| trace_id | Yes | OpenTelemetry hex | Cross-service waterfall |
| messages or hash | Yes | Redacted JSON or prompt_sha256 | Replay without storing raw PII |
| model | Yes | gpt-4o-2024-08-06 | Detect silent provider swaps |
| tokens_in/out | Yes | 1842 / 412 | Cost + context pressure |
| latency_ms | Yes | ttft: 420, total: 3100 | SLO dashboards |
| cost_usd | Yes | 0.0184 | Per-feature chargeback |
| retrieval_ids | RAG yes | ["chk_91","chk_44"] | Debug wrong citations |
| tool_calls | Agents yes | [{name, args_hash, latency_ms}] | Loop detection |
| guardrail_decisions | Yes | [{stage, action, dimension}] | Join Guide 4 block rates |
| metadata | Yes | tenant, feature, prompt_version | Slice drift alerts |
Messages: store vs hash
Production rarely logs full prompts with customer PII. Pattern: log messages_redacted (Presidio scrubbed) in prod; log full payload in staging; always log prompt_version, system_prompt_hash, and context_chunk_ids[] so you can reconstruct from versioned stores when investigating incidents.
Redaction levels
| Environment | User content | Assistant content | Retrieval |
|---|---|---|---|
| Production | Mask PII; truncate >2k chars | Store hash + first 200 chars | chunk_id + score only |
| Staging | Full with synthetic data only | Full | Full chunk text |
| Debug (sampled) | Encrypted blob, 0.1% sample | Same | Same |
Model and provider metadata
Log both model_requested and model_resolved after gateway routing. Bedrock and Azure often map friendly names to ARNs. Include provider_request_id for vendor support escalations during outage windows.
Tokens, latency, and cost
Compute cost_usd in-app from a versioned price table—do not rely on billing CSV latency. Break out cached_tokens when using prompt caching; subtract from marginal cost dashboards.
from pydantic import BaseModel, Field
from typing import Any
class LlmCallLog(BaseModel):
request_id: str
trace_id: str
feature: str
tenant_id: str
prompt_version: str
model_requested: str
model_resolved: str
provider: str
tokens_in: int
tokens_out: int
cached_tokens: int = 0
cost_usd: float
latency_ms: dict[str, float] # ttft, total, tools
finish_reason: str
messages_hash: str
retrieval_ids: list[str] = Field(default_factory=list)
tool_calls: list[dict[str, Any]] = Field(default_factory=list)
guardrail_decisions: list[dict[str, str]] = Field(default_factory=list)
metadata: dict[str, Any] = Field(default_factory=dict)
def emit_log(log: LlmCallLog) -> None:
# Never log full PII — messages_hash + redacted preview only
logger.info("llm_call", extra=log.model_dump())
public record LlmCallLog(
String requestId, String traceId, String feature, String tenantId,
String promptVersion, String modelRequested, String modelResolved,
String provider, int tokensIn, int tokensOut, int cachedTokens,
double costUsd, Map latencyMs, String finishReason,
String messagesHash, Map metadata) {}
public void emitLog(LlmCallLog log) {
// Never log full PII — messagesHash + redacted preview only
structuredLog.info("llm_call", log);
}
from opentelemetry import trace
tracer = trace.get_tracer("llm-gateway")
def span_llm_call(model: str, messages_hash: str):
return tracer.start_as_current_span(
"chat.completions",
attributes={
"gen_ai.system": "openai",
"gen_ai.request.model": model,
"gen_ai.prompt.hash": messages_hash,
"feature": "refund_copilot",
},
)
# After response:
# span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
# span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
public Span spanLlmCall(String model, String messagesHash) {
return tracer.spanBuilder("chat.completions")
.setAttribute("gen_ai.system", "openai")
.setAttribute("gen_ai.request.model", model)
.setAttribute("gen_ai.prompt.hash", messagesHash)
.setAttribute("feature", "refund_copilot")
.startSpan();
}
PRICES = {
"gpt-4o-2024-08-06": {"in": 2.50, "out": 10.00}, # USD per 1M tokens
"text-embedding-3-small": {"in": 0.02, "out": 0.0},
}
def cost_usd(model: str, tokens_in: int, tokens_out: int) -> float:
p = PRICES[model]
return (tokens_in * p["in"] + tokens_out * p["out"]) / 1_000_000
def enrich_with_cost(log: dict) -> dict:
log["cost_usd"] = round(cost_usd(log["model"], log["tokens_in"], log["tokens_out"]), 6)
return log
public double costUsd(String model, int tokensIn, int tokensOut) {
var p = priceTable.get(model);
return (tokensIn * p.inputPerMillion() + tokensOut * p.outputPerMillion()) / 1_000_000.0;
}
Standardize on one metadata.feature enum—refund_copilot, ticket_summarizer—before you have 40 microservices logging service_name inconsistently.
Never log API keys, raw OAuth tokens, or full PAN/SSN—even in ‘temporary’ debug flags. Use hash + encrypted object storage with TTL for sampled full prompts.
Full message logging aids debug but increases GDPR surface and storage cost. Hash + versioned prompt registry + chunk IDs captures 90% of incidents with 10% of the risk.
Metadata per call: minimum viable tags
- tenant_id — multi-tenant cost and quality isolation
- prompt_version — git tag or registry semver
- retrieval_index_version — vector store snapshot
- guardrail_actions[] — from Guide 4 traces
- user_locale — slice drift detection
FAQ: Log embeddings?
Log embedding_model, tokens_in, and query hash—not the 1536-d vector unless debugging index issues in staging.
LangSmith & Langfuse: traces, feedback, datasets
LangSmith (LangChain-native) and Langfuse (open-source, self-hostable) are the two most common LLM trace hubs. Both capture runs, attach feedback, link datasets to failures, and support prompt versioning—pick based on LangChain coupling and data residency.
Traces without feedback are archaeology. Production observability needs a place to store run trees, human thumbs, judge scores, and the golden rows those failures become. LangSmith and Langfuse both close that loop; they differ on deployment model, LangChain depth, and self-host flexibility.
LangSmith vs Langfuse
| Dimension | LangSmith | Langfuse |
|---|---|---|
| Deployment | Cloud (LangChain) | Cloud or self-host (Docker/K8s) |
| LangChain integration | Automatic run trees | SDK decorators; works without LC |
| Runs & spans | Runs, child steps, playground replay | Traces, generations, sessions |
| Feedback | Human + API scores on runs | Scores linked to traces |
| Datasets | First-class; trace → example | Datasets + prompt registry |
| Best for | LangGraph monoliths, fast iteration | Regulated industries, multi-stack |
| Watch out | Vendor lock-in; trace volume cost | Self-host ops (Postgres + ClickHouse) |
LangSmith
Deepest integration with LangChain: automatic run trees, playground from traces, datasets linked to failures. Strong for teams already on LangGraph. Price scales with traced tokens; use sampling in high-QPS chat.
Langfuse
Open-source trace UI with sessions, generations, scores, and prompt versions. Self-host for regulated industries; cloud for speed. Python/JS SDKs wrap manual spans without LangChain.
When to pick which
| Situation | Pick | Why |
|---|---|---|
| LangChain / LangGraph monolith | LangSmith | Zero-config run trees + dataset export |
| Must self-host traces (HIPAA, EU) | Langfuse OSS | Data stays in your VPC |
| Polyglot stack (Java + Python) | Langfuse SDK | Framework-agnostic decorators |
| Need traces + CI eval in one UI | LangSmith | Datasets linked to failed runs |
| Cost/latency only, week-one bootstrap | Helicone (next section) | Proxy, not full trace hub |
import os
from langsmith import Client
from langchain_openai import ChatOpenAI
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "refund-copilot"
client = Client()
llm = ChatOpenAI(model="gpt-4o-mini")
# Runs auto-export when LANGCHAIN_TRACING_V2=true
result = llm.invoke("What is the refund window?")
# Attach human feedback to the run
client.create_feedback(
run_id=result.response_metadata["run_id"],
key="helpfulness",
score=0.85,
comment="Correct policy, missing link",
)
// LangSmith REST — manual run log from non-LangChain Java service
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create("https://api.smith.langchain.com/runs"))
.header("x-api-key", langSmithKey)
.POST(HttpRequest.BodyPublishers.ofString(runJson))
.build();
// POST feedback: /feedback with run_id + score
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse()
@observe()
def answer_refund_question(request_id: str, query: str) -> str:
langfuse_context.update_current_trace(metadata={"request_id": request_id})
chunks = retrieve(query) # nest @observe on retrieve too
response = llm.complete(build_messages(chunks, query))
langfuse_context.score_current_trace(name="helpfulness", value=0.9)
return response
// Langfuse Java SDK — manual span
LangfuseClient client = LangfuseClient.create(publicKey, secretKey);
client.trace(new TraceBody("refund-" + requestId)
.metadata(Map.of("feature", "refund_copilot")));
// Log generation with model, usage, output
Support platform team exported LangSmith failed runs to JSONL nightly—same rows became CI gold in Guide 6 within one sprint.
LangSmith pricing tracks traced tokens—sample 5–10% of prod chat, 100% staging. Self-host Langfuse trades license for ~$800–2k/mo infra.
Set LANGCHAIN_PROJECT per feature (refund, billing, internal)—prevents trace UI noise and simplifies dataset export.
FAQ: LangSmith without LangChain?
Possible via REST/SDK manual runs, but awkward—pick Langfuse if not on LangChain.
Helicone: cost and latency proxy
Helicone sits as an OpenAI-compatible reverse proxy—swap base_url, get request logging, cost attribution, latency histograms, and caching headers in minutes. Complement (not replace) LangSmith/Langfuse when you need fast cost telemetry without full RAG span instrumentation.
Helicone sees every proxied completion: model, tokens, latency, status, and custom properties via headers. It cannot see retrieval or guardrail logic that runs before the proxy call—use it for cost/latency SLIs and budget alerts while SDK tracing captures the full waterfall.
Helicone capabilities
| Feature | How | Use case |
|---|---|---|
| Request logging | Proxy all OpenAI traffic | Instant cost dashboard |
| Custom properties | Helicone-Property-* headers | Slice by tenant, feature, prompt_version |
| Response caching | Helicone-Cache-Enabled | Repeat FAQ queries; watch stale policy risk |
| Budget alerts | Helicone dashboard + webhooks | Page when daily spend > forecast |
| Latency tracking | Per-request TTFT + total | Provider incident detection |
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": f"Bearer {os.environ['HELICONE_API_KEY']}",
"Helicone-Property-Feature": "refund_copilot",
"Helicone-Property-Tenant": tenant_id,
"Helicone-Property-Prompt-Version": prompt_version,
},
)
OpenAIClient client = OpenAIOkHttpClient.builder()
.baseUrl("https://oai.helicone.ai/v1")
.addHeader("Helicone-Auth", "Bearer " + heliconeKey)
.addHeader("Helicone-Property-Feature", "refund_copilot")
.addHeader("Helicone-Property-Tenant", tenantId)
.addHeader("Helicone-Property-Prompt-Version", promptVersion)
.apiKey(openAiKey)
.build();
Caching headers
Enable Helicone cache for idempotent FAQ lookups—pair with prompt_version in cache key via property headers. Invalidate cache on policy deploy, not just model deploy.
headers = {
"Helicone-Auth": f"Bearer {helicone_key}",
"Helicone-Cache-Enabled": "true",
"Helicone-Cache-Bucket-Max-Size": "1000",
"Helicone-Property-Prompt-Version": "refund_v12",
}
# Only cache temperature=0, no tool calls, no user-specific context
.addHeader("Helicone-Cache-Enabled", "true")
.addHeader("Helicone-Cache-Bucket-Max-Size", "1000")
.addHeader("Helicone-Property-Prompt-Version", "refund_v12")
Budget alerts
Configure Helicone webhooks for daily spend thresholds; forward to PagerDuty for hard caps and Slack for soft warnings. Mirror the same thresholds in your internal cost_usd logs so Helicone and app metrics agree.
Helicone proxy is fastest to deploy but blind to pre-proxy retrieval and guardrail spans. Use alongside Langfuse/LangSmith for full waterfall.
Caching policy answers without prompt_version in cache key—users get stale refund rules after deploy.
Helicone free tier covers early prod; paid tiers scale with logged requests. ROI is catching one runaway agent loop before finance notices.
Dashboards & alerts: SLIs and thresholds
Production dashboards track five core SLIs: refusal rate, faithfulness proxy, P95 latency, error rate, and cost per request—plus guardrail block rate from Guide 4. Wire alert thresholds in Grafana or Datadog with runbooks, not Slack noise.
SLIs for LLM apps mirror microservices—latency, errors, saturation—with semantic fourth signals: quality and cost. Platform SRE owns TTFT and error budgets; product ML owns sampled judge scores and hallucination rates. Define SLOs per feature, not globally—a ticket summarizer can tolerate 8s p95 while checkout copilot targets 2s TTFT p95.
Core SLIs at a glance
| SLI | Definition | Typical threshold | Source |
|---|---|---|---|
| Refusal rate | Guardrail + policy blocks / total requests | 2–8% baseline; alert if 2× spike | guardrail_decisions |
| Faithfulness proxy | Sampled NLI/judge pass rate | ≥88% weekly; alert if −8pts | Online eval sample |
| P95 latency | TTFT + total wall time | TTFT <800ms, total <4s | Helicone / app metrics |
| Error rate | 429, 5xx, overflow, empty | <1% 5xx; <5% 429 | Typed error_class |
| Cost / request | cost_usd p50/p99 | Alert if p99 3× baseline 1h | Structured logs |
| Guardrail block rate | Blocks by dimension (Guide 4) | Injection spike → SecOps | dimensions_flagged |
TTFT p95 and total latency
| SLI | Definition | Typical SLO (chat) | Alert |
|---|---|---|---|
| TTFT p95 | First token − request start | < 800 ms | Page if >1.2s 15m |
| Total p95 | Last token − request start | < 4 s | Ticket if >6s 1h |
| Stream stall | Gap between chunks >2s | < 0.5% requests | Warn |
| Tool round-trip | Per tool span | < 1.5s p95 | Per dependency |
Error rate (typed)
Split 429, context_length_exceeded, guardrail blocks, empty completions, and JSON parse failures—lumping into “5xx rate” hides fixable product bugs.
| Error class | Owner | Mitigation |
|---|---|---|
| Rate limit 429 | Platform | Token bucket, queue, model fallback |
| Context overflow | App eng | Summarize thread, trim retrieval |
| Guardrail block | Safety | Expected; track false positive sample |
| Invalid JSON tool args | App eng | Repair prompt, tighten schema |
Token and cost per feature
Dashboard: sum(cost_usd) by feature / DAU, plus tokens_in p95 as early warning for prompt bloat. Anomaly detection on daily spend per tenant catches infinite agent loops before finance does.
Hallucination sampling
You cannot run NLI faithfulness on 100% of prod traffic. Sample 1–5% stratified by locale and intent; run lightweight judge (gpt-4o-mini with retrieved-only context); store scores in trace UI. Trend weekly; page on 10pt drop vs rolling baseline.
from prometheus_client import Histogram, Counter
TTFT = Histogram("llm_ttft_seconds", "Time to first token", ["model", "feature"])
ERRORS = Counter("llm_errors_total", "LLM errors", ["error_class", "feature"])
COST = Counter("llm_cost_usd_total", "Estimated USD", ["feature", "tenant"])
def observe_call(meta: dict):
TTFT.labels(meta["model"], meta["feature"]).observe(meta["latency_ms"]["ttft"] / 1000)
COST.labels(meta["feature"], meta["tenant_id"]).inc(meta["cost_usd"])
if meta.get("error_class"):
ERRORS.labels(meta["error_class"], meta["feature"]).inc()
meterRegistry.summary("llm.ttft.seconds", "model", model, "feature", feature)
.record(ttftMs / 1000.0);
meterRegistry.counter("llm.cost.usd.total", "feature", feature, "tenant", tenantId)
.increment(costUsd);
import random
from faithfulness import nli_entailment # your wrapper
SAMPLE_RATE = 0.03
def maybe_score_faithfulness(request_id: str, answer: str, chunks: list[str]) -> None:
if random.random() > SAMPLE_RATE:
return
score = nli_entailment(answer, chunks) # 0..1
traces.log_score(request_id, name="faithfulness", value=score)
metrics.gauge("llm.faithfulness.sample", score, tags=[f"req:{request_id[:8]}"])
public void maybeScoreFaithfulness(String requestId, String answer, List chunks) {
if (ThreadLocalRandom.current().nextDouble() > SAMPLE_RATE) return;
double score = faithfulnessClient.nliEntailment(answer, chunks);
traces.logScore(requestId, "faithfulness", score);
}
# Pseudocode for multi-window burn alert
# SLO: TTFT p95 < 800ms over 30d
# Fast burn: 14.4x budget in 1h → page
def ttft_burn_rate(window_minutes: int) -> float:
p95 = metrics.p95("llm.ttft_ms", window=f"{window_minutes}m")
slo_ms = 800
error_budget_consumed = max(0, p95 - slo_ms) / slo_ms
return error_budget_consumed / (window_minutes / (30 * 24 * 60))
if ttft_burn_rate(60) > 14.4:
pager.trigger("LLM TTFT SLO fast burn")
double ttftBurnRate(int windowMinutes) {
double p95 = metrics.p95("llm.ttft_ms", windowMinutes);
double errorBudget = Math.max(0, p95 - 800) / 800.0;
return errorBudget / (windowMinutes / (30.0 * 24 * 60));
}
Plot tokens_out vs total_latency_ms—linear correlation confirms decode-bound slowness; flat line implicates network or queue.
Averaging faithfulness scores without volume weighting—a 0.2 score on 2 samples moves the weekly chart as much as 10k samples.
Histogram buckets for TTFT should start at 50ms, not 1s—LLM TTFT distributions are heavy-tailed below 200ms with a long secondary bump from queueing.
Dashboard row (minimum)
- TTFT p50/p95 by model and feature
- Error rate by error_class
- Daily cost_usd by tenant (top 10)
- Faithfulness sample 7-day rolling avg
- Guardrail block rate (from Guide 4)
Grafana / Datadog sketch
Export Prometheus metrics from your log pipeline or OTel collector. Typical dashboard rows: TTFT heatmap by model, refusal rate stacked by guardrail dimension, faithfulness 7-day rolling avg, cost/tenant top-N, error class breakdown. Datadog: use gen_ai.* facets if using OpenLLMetry; correlate with APM service map.
Link guardrail block-rate panel to Guide 4 dimensions—injection spikes route to SecOps; PII blocks route to privacy oncall. One “block rate” metric hides which rail failed.
Alert thresholds
LLM alerts fall into four buckets: errors (hard failures), latency (SLO burn), cost anomalies (spend without traffic), and quality (judge score drift, escalation spikes). Each needs a runbook link, an owner, and explicit “expected noise” for guardrail blocks during attacks.
Error alerts
| Condition | Severity | Runbook action |
|---|---|---|
| 429 rate >5% 5m | Page | Enable queue; switch fallback model |
| 5xx from provider >1% 5m | Page | Status page; regional failover |
| Context overflow >10% 15m | Ticket | Trim retrieval; thread summarization |
| Empty completion >0.5% | Warn | Check filter/harm blocks misfiring |
Latency alerts
Use multi-window burn rates on TTFT and total latency. Pair with tokens_out p95 —if latency alert fires without token spike, suspect provider incident not your deploy.
Cost anomaly alerts
Forecast daily spend with 7-day seasonal baseline; alert when actual >2σ or >120% of forecast. Slice by tenant_id to catch one customer’s agent loop, not fleet-wide noise.
| Signal | Threshold | Likely cause |
|---|---|---|
| Cost/req p99 3× baseline | 1 hour | Runaway tool loop, missing max_tokens |
| Embedding cost 2× | 24 h | Re-embed cron duplicated |
| Judge token line item 5× | 24 h | Sample rate config typo (0.3 vs 0.03) |
Quality alerts
Quality signals lag—use 24h windows. Alert on faithfulness sample drop >8pts vs 7d baseline, CSAT drop correlated with model version tag, or support escalation keyword rate spike (“wrong policy”, “made up”).
ALERTS = [
{"name": "llm_429_spike", "expr": "rate(llm_errors{class='429'}[5m]) > 0.05", "severity": "page"},
{"name": "llm_cost_anomaly", "expr": "daily_cost > forecast * 1.2", "severity": "ticket"},
{"name": "faithfulness_drop", "expr": "faithfulness_7d_avg < baseline - 0.08", "severity": "ticket"},
]
def route(alert: dict):
if alert["severity"] == "page":
pagerduty.trigger(alert["name"], runbook="wiki/llm-incident")
else:
slack.post("#llm-ops", alert["name"])
public void route(Alert alert) {
if ("page".equals(alert.severity())) {
pagerduty.trigger(alert.name(), "wiki/llm-incident");
} else {
slack.post("#llm-ops", alert.name());
}
}
import statistics as stats
history = load_daily_cost(feature="refund_copilot", days=28)
today = current_daily_cost()
mu = stats.mean(history)
sigma = stats.pstdev(history) or 1.0
z = (today - mu) / sigma
if z > 2.5:
fire_alert("cost_anomaly", {"z": z, "today_usd": today, "mean_usd": mu})
double mu = history.stream().mapToDouble(d -> d).average().orElse(0);
double sigma = stdDev(history);
double z = (today - mu) / Math.max(sigma, 1e-6);
if (z > 2.5) fireAlert("cost_anomaly", Map.of("z", z));
def check_quality_regression():
for version in active_prompt_versions():
score = metrics.avg("faithfulness", filter={"prompt_version": version}, window="24h")
baseline = baselines.get(version)
if score < baseline - 0.08:
create_incident(
title=f"Quality drop on {version}",
owner="ml-oncall",
links=[f"langfuse://prompts/{version}"],
)
void checkQualityRegression() {
for (String version : activePromptVersions()) {
double score = metrics.avg("faithfulness", Map.of("prompt_version", version), "24h");
double baseline = baselines.get(version);
if (score < baseline - 0.08) createIncident(version, score);
}
}
Wire guardrail block spikes to SecOps when dimension includes injection or PII—not all blocks belong in #llm-ops noise.
Cost anomaly caught a deploy that set max_tool_iterations=50 default—support bot burned $12k overnight calling search in a loop.
"What do you alert on for LLM prod?" — Typed errors, TTFT SLO burn, cost/req anomalies, sampled quality drift—not BLEU on live traffic.
Runbook template
- Acknowledge; check provider status and recent deploys
- Compare TTFT vs tokens_out; open trace exemplars
- Roll back prompt_version or enable fallback model
- Post-mortem: add failing trace to golden set (Guide 6 CI)
Online evals: sample, score, promote to gold
Run online evals on 1–5% of production traffic: shadow scoring with rules and lightweight judges, a human review queue for borderline cases, automatic promotion of failures to the golden dataset, and week-over-week drift detection.
Offline CI cannot cover every user phrasing, locale, or adversarial pattern. Online eval closes the gap without bankrupting inference—sample stratified traffic, score asynchronously, and feed failures back to gold within days.
Sample 1–5% production traffic
| Strategy | Rate | When |
|---|---|---|
| Head-based random | 1–3% | Steady-state chat |
| Stratified by locale/intent | 5% on thin slices | Bias audit, new market launch |
| 100% on error/guardrail block | All failures | Always—cheap signal |
| 100% staging | All | Pre-prod prompt experiments |
Shadow scoring with rules
Shadow path runs after the user sees the response—never block on online judge latency. Apply cheap rules first: regex policy phrases, citation ID match against retrieval_ids, JSON schema re-validation, max length. Escalate to LLM-as-judge only when rules are inconclusive.
import random
from queue import Queue
shadow_queue: Queue = Queue()
def maybe_enqueue_shadow(log: dict) -> None:
if log.get("guardrail_decisions") and any(d["action"] == "block" for d in log["guardrail_decisions"]):
shadow_queue.put(log) # 100% on blocks
return
if random.random() < 0.03: # 3% sample
shadow_queue.put(log)
def shadow_worker():
while True:
log = shadow_queue.get()
rules = run_rule_checks(log) # citation match, banned phrases
if rules["passed"] is None: # inconclusive
rules["judge_score"] = llm_judge(log)
store_online_eval_result(log["request_id"], rules)
void maybeEnqueueShadow(LlmCallLog log) {
boolean blocked = log.guardrailDecisions().stream()
.anyMatch(d -> "block".equals(d.get("action")));
if (blocked || ThreadLocalRandom.current().nextDouble() < 0.03) {
shadowQueue.offer(log);
}
}
Human review queue
Route low-confidence shadow scores (judge 0.4–0.6, rule conflicts, user thumbs-down) to a review UI in LangSmith or Langfuse. Reviewers label pass/fail with rubric notes; one click promotes to golden JSONL with source=production and trace_id link.
Promote failures to golden dataset
- Shadow score fails or human marks fail
- Export row: input, expected policy, actual output, retrieval_ids, prompt_version
- PR to evals/gold/ with reviewer sign-off
- CI runs on merge—failure becomes regression gate in Guide 6
Drift detection week-over-week
Compare weekly aggregates: faithfulness sample mean, refusal rate by dimension, embedding centroid similarity of answers, CSAT on bot-handled tickets. Alert when any metric moves >2σ vs four-week baseline—not on single-day noise.
| Drift signal | Window | Action |
|---|---|---|
| Faithfulness −8pts | 7d vs prior 7d | Freeze prompt merges; open incident |
| Refusal rate 2× | 24h | Check attack traffic vs false positives |
| Answer embedding centroid | Weekly | Index rot or model swap investigation |
| Escalation keyword rate | 7d | Product review; add gold rows |
Tag every online eval row with prompt_version and model_resolved—drill-down without guessing which deploy caused drift.
100% online LLM-as-judge in prod—doubles inference cost and adds judge bias. Rules first, judge on sample, human on borderline.
"Online vs offline eval?" — Offline blocks bad merges; online detects drift on live traffic; failures promote to gold; closed loop in days not quarters.
Shadow queue should be durable (SQS/Kafka)—dropping samples during traffic spikes hides the exact failures you need most.
Track 5 observability production checklist
Ship when structured logs, traces, dashboards, and alerts pass the Track 5 observability checklist—then connect exemplar failures to CI eval in the next guide.
Guide 5 closes when traces, dashboards, and alerts match the Track 5 observability checklist below—then wire regression gates in Guide 6 CI. Observability without CI regression is reactive; CI without production telemetry is blind to index rot and provider drift.
Track 5 observability checklist
| Item | Done when | Owner |
|---|---|---|
| Canonical LlmCallLog schema in git | Code review enforced | Platform |
| Trace tool chosen + OTel export | Staging 100% sampled | Platform |
| TTFT + total latency dashboards | Per feature SLO lines drawn | SRE |
| Cost per tenant/feature | Daily chart + anomaly alert | FinOps + ML |
| Faithfulness 3% sample | 7d baseline stored | ML |
| Guardrail fields joined on trace | From Guide 4 checklist | Safety |
| Incident runbook linked in alerts | On-call drill completed | SRE |
| PII redaction verified | Sec review sign-off | Security |
Pre-CI handoff fields
Export these to your eval CI pipeline (next guide):
- prompt_version, model_resolved
- faithfulness.sample weekly avg
- cost_usd per request for budget regression tests
- Exemplar trace_id links on failed judge scores
CHECKS = {
"schema_tests_pass": run_pytest("tests/test_llm_log_schema.py"),
"staging_sample_rate": env_float("PROD_TRACE_SAMPLE") <= 0.1,
"dashboard_exists": grafana.api.dashboard_uid("llm-ttft") is not None,
"alert_linked": pagerduty.has_runbook("llm_429_spike"),
}
def observability_ready() -> bool:
failed = [k for k, v in CHECKS.items() if not v]
if failed:
print("Blocked:", failed)
return False
return True
Map> checks = Map.of(
"schema_tests_pass", () -> pytest.run("tests/LlmLogSchemaTest"),
"staging_sample_rate", () -> prodTraceSample <= 0.1f
);
boolean observabilityReady() {
return checks.entrySet().stream().allMatch(e -> e.getValue().get());
}
AGENDA = [
"TTFT p95 vs SLO by feature",
"Top 5 cost tenants + anomalies",
"Faithfulness sample trend",
"New error classes this week",
"Traces promoted to golden set",
]
# 30min ML + SRE + product — outputs tickets, not slides
List AGENDA = List.of(
"TTFT p95 vs SLO by feature",
"Top 5 cost tenants + anomalies",
"Faithfulness sample trend"
);
You can explain LLM-specific observability gaps, log a canonical schema, pick tracing tooling, define SLIs, and alert with runbooks. Guide 6 adds CI evaluation—regression gates that block bad prompt merges before they reach the traffic you just instrumented.
Checklist: verify prod logs contain zero raw PAN/SSN samples in weekly automated scan; alert if regex hits increase.
Teams that complete this checklist before CI eval report 40% faster root-cause on prompt regressions—trace exemplars become golden rows automatically.
"Observability stack for LLMs?" — Schema per call, trace waterfall, TTFT/cost SLIs, sampled judges, anomaly alerts, closed loop to eval datasets.
Budget 1–3% of inference spend for observability SaaS + storage—cheaper than one undetected runaway agent weekend.
Trace retention 7–14d hot, 90d cold object storage—long enough for weekly quality review, not a compliance lake by accident.
100% tracing gives perfect forensics but doubles cost at scale—10% head-based sampling plus 100% on errors is the usual compromise.
Shipping dashboards without alert runbooks—on-call acks pages then closes them; drift continues until CSAT collapses.
Track 5 guide sequence
| Guide | Topic | You are here |
|---|---|---|
| Guide 4 | Safety & guardrails | ← prev |
| Guide 5 | Observability & monitoring | ✓ |
| Guide 6 | Continuous evaluation in CI | next → |
FAQ: Sample rate for traces?
100% staging; prod 5–10% head sample plus 100% on error and guardrail block. Increase temporarily during launches.
FAQ: Who owns quality alerts?
ML oncall triages; product approves prompt rollback; platform owns latency/error pages.