Observability & monitoring

Observability explained: why offline evals are not enough

Offline evals answer “did we pass the golden set we wrote?” Online observability answers “is production still healthy on traffic we never imagined?” This section contrasts offline vs online evaluation, maps the three pillars for LLM apps—traces, metrics, logs—and closes the eval loop from production failures back to gold.

Your refund copilot passes 200 golden cases in CI on Tuesday. Wednesday morning support volume doubles and pass-at-a-glance quality craters—same deploy, same prompt hash, same model string in config. Traditional APM shows green p99 latency and zero 5xx errors because every request returned HTTP 200 with a plausible paragraph. LLM observability exists because correctness is probabilistic, spend is per-token, and failure modes hide inside successful responses. Offline evals are necessary; they are not sufficient.

Offline vs online evaluation

Dimension	Offline eval (CI / nightly)	Online observability (prod)
Traffic	Frozen golden JSONL	Live user queries, seasonal spikes, attacks
Latency	Batch, unbounded wall time	TTFT p95, queue depth, user-facing SLOs
Cost	Budgeted job; capped N cases	Per-request token spend; runaway agent loops
Correctness	Pass rate vs rubric	Sampled judges, thumbs, escalations, drift
Feedback loop	Human updates gold before merge	Auto-promote failures → gold within days
Guardrails	Adversarial gold in CI	Block-rate dashboards, injection spikes

Three pillars for LLM apps

Standard observability stacks still apply—but LLM apps need pillar-specific fields. Traces waterfall retrieval, tool calls, guardrail stages, and generation spans under one trace_id. Metrics aggregate refusal rate, faithfulness proxy, P95 latency, error rate, and cost per request. Logs emit structured JSON per request (next section) for audit, billing reconciliation, and incident replay without storing raw PII.

Pillar	LLM-specific signal	Example tool
Traces	Span per retrieve / tool / generate / guardrail	LangSmith, Langfuse, OTel
Metrics	TTFT p95, tokens_in/out, cost_usd, block_rate	Prometheus, Datadog, Grafana
Logs	prompt_version, retrieval_ids, guardrail_decisions	JSON → Loki / CloudWatch / ELK

flowchart LR
  subgraph offline["Offline eval"]
    G[Golden JSONL] --> CI[CI / nightly suite]
    CI -->|pass| Ship[Deploy]
  end
  subgraph online["Online observability"]
    Prod[Production traffic] --> Log[Structured logs + traces]
    Log --> Dash[Dashboards + alerts]
    Dash --> Sample[Sample 1-5% shadow scoring]
  end
  Sample -->|failure| Review[Human review queue]
  Review --> G
  Dash -->|drift alert| Rollback[Prompt rollback / feature flag]

Closing the eval loop: production → gold

Mature teams treat production surprises as dataset candidates, not one-off tickets. A faithfulness alert fires → on-call opens trace exemplars → PM confirms bad answer → row added to gold/refund_policy_v3.jsonl within 48h → CI blocks the next regression. Observability without this loop produces pretty dashboards that never change prompts.

Non-deterministic outputs

Identical messages[], temperature 0, and model ID can still diverge across provider patches, batching order, and floating-point nondeterminism on some GPUs. You cannot assert response == expected in production—you score distributions, sample traces, and compare embedding centroids week over week. Offline evals give point estimates; online observability tracks drift of those estimates on live traffic slices.

Same input, different answer — user retries after a vague reply and gets contradictory policy guidance.
Silent model swap — provider routes gpt-4o-2024-08-06 to a newer snapshot without a version bump in your config.
Tool nondeterminism — agent calls search twice; second result set changes the final refund decision.

Latency variance (TTFT vs total)

LLM latency is bimodal: time-to-first-token (TTFT) reflects queue depth and prompt size; total latency adds variable decode length. A p99 total latency alert fires during a marketing push not because GPUs failed but because users ask longer questions and models emit longer answers. Split SLIs into ttft_ms, decode_ms, and tokens_out—otherwise you chase the wrong knob (scale replicas when you need max_tokens caps).

Signal	Traditional API	LLM call	What to chart
Success	HTTP 2xx	HTTP 2xx + empty/refusal/hallucination	Semantic pass rate sample
Latency	Single wall time	TTFT + decode + tool round-trips	TTFT p50/p95, total p95
Payload	Fixed JSON schema	Variable tokens in/out	Tokens per feature, context fill ratio
Errors	Exceptions, 5xx	Rate limits, context overflow, policy block	Typed error codes per vendor
Correctness	Deterministic tests	Judge scores, user thumbs, task success	Weekly drift vs golden centroid

Cost unpredictability

A “simple” FAQ bot becomes expensive when retrieval returns 40 chunks, the model cites each, and a follow-up rewrites the whole thread into context. Cost scales with tokens in + tokens out + tool calls + judge evals, not QPS alone. Finance sees a 3× bill spike while traffic is flat—you need per-feature, per-tenant, per-model attribution in logs, not an aggregate CloudWatch billing alarm two days late.

Cost drivers to attribute per request

Driver	Log field	Typical surprise
Prompt bloat	tokens_in	System prompt duplicated per turn
Runaway decode	tokens_out, finish_reason	Missing max_tokens on tool loop
Retry storm	attempt	429 backoff not capped
Judge tax	eval_judge_tokens	100% online LLM-as-judge in prod
Embedding refresh	embedding_tokens	Re-embed on every query

Quality drift without deploys

Index rot, competitor docs changing, seasonal intents, and adversarial prompt patterns shift quality while git SHA is frozen. RAG retrieval recall drops when a PDF update splits tables across chunks—answers sound confident and wrong. Observability connects retrieval metrics (hit rate, MRR@k), generation metrics (faithfulness sample), and product metrics (escalation rate, CSAT) on shared request_id so you know which layer drifted.

import hashlib
from dataclasses import dataclass, asdict

@dataclass
class LlmCallMeta:
    request_id: str
    model: str
    provider: str
    prompt_version: str
    temperature: float
    seed: int | None
    tokens_in: int
    tokens_out: int
    finish_reason: str
    response_hash: str  # sha256 of normalized text

def normalize(text: str) -> str:
    return " ".join(text.split()).lower()

def record_completion(req_id: str, model: str, text: str, usage, **kwargs) -> LlmCallMeta:
    meta = LlmCallMeta(
        request_id=req_id,
        model=model,
        provider=kwargs.get("provider", "openai"),
        prompt_version=kwargs["prompt_version"],
        temperature=kwargs.get("temperature", 0.0),
        seed=kwargs.get("seed"),
        tokens_in=usage.prompt_tokens,
        tokens_out=usage.completion_tokens,
        finish_reason=kwargs.get("finish_reason", "stop"),
        response_hash=hashlib.sha256(normalize(text).encode()).hexdigest()[:16],
    )
    logger.info("llm_completion", extra=asdict(meta))
    return meta

public record LlmCallMeta(
    String requestId, String model, String provider, String promptVersion,
    double temperature, Integer seed, int tokensIn, int tokensOut,
    String finishReason, String responseHash) {}

public LlmCallMeta recordCompletion(String reqId, String model, String text, Usage usage, Map kwargs) {
  String hash = sha256(normalize(text)).substring(0, 16);
  var meta = new LlmCallMeta(reqId, model, (String) kwargs.getOrDefault("provider", "openai"),
      (String) kwargs.get("prompt_version"), (double) kwargs.getOrDefault("temperature", 0.0),
      (Integer) kwargs.get("seed"), usage.promptTokens(), usage.completionTokens(),
      (String) kwargs.getOrDefault("finish_reason", "stop"), hash);
  structuredLog.info("llm_completion", meta);
  return meta;
}

import time
from contextlib import contextmanager

@contextmanager
def llm_stream_timers(request_id: str):
    marks = {"start": time.perf_counter(), "first_token": None}
    yield marks
    total_ms = (time.perf_counter() - marks["start"]) * 1000
    ttft_ms = (marks["first_token"] - marks["start"]) * 1000 if marks["first_token"] else total_ms
    decode_ms = total_ms - ttft_ms
    metrics.timing("llm.ttft_ms", ttft_ms, tags=[f"req:{request_id[:8]}"])
    metrics.timing("llm.decode_ms", decode_ms)
    metrics.timing("llm.total_ms", total_ms)

# In streaming handler:
# marks["first_token"] = time.perf_counter()  # on first SSE chunk

public class LlmStreamTimers implements AutoCloseable {
  private final long start = System.nanoTime();
  private Long firstTokenNs = null;

  public void onFirstToken() {
    if (firstTokenNs == null) firstTokenNs = System.nanoTime();
  }

  @Override
  public void close() {
    long totalMs = (System.nanoTime() - start) / 1_000_000;
    long ttftMs = firstTokenNs == null ? totalMs : (firstTokenNs - start) / 1_000_000;
    metrics.timing("llm.ttft_ms", ttftMs);
    metrics.timing("llm.decode_ms", totalMs - ttftMs);
    metrics.timing("llm.total_ms", totalMs);
  }
}

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class DriftMonitor:
    def __init__(self, baseline_vectors: np.ndarray, threshold: float = 0.92):
        self.centroid = baseline_vectors.mean(axis=0)
        self.threshold = threshold

    def score_batch(self, live_vectors: np.ndarray) -> dict:
        live_centroid = live_vectors.mean(axis=0)
        sim = float(cosine_similarity([self.centroid], [live_centroid])[0, 0])
        return {"centroid_similarity": sim, "drift_alert": sim < self.threshold}

# Run nightly on sampled answer embeddings; page if drift_alert.

public record DriftScore(double centroidSimilarity, boolean driftAlert) {}

public DriftScore scoreBatch(double[][] liveVectors, double[] baselineCentroid, double threshold) {
  double[] liveCentroid = meanRows(liveVectors);
  double sim = cosineSimilarity(baselineCentroid, liveCentroid);
  return new DriftScore(sim, sim < threshold);
}

⚠️ Pitfall

Alerting on exact string match for regression—flaky and meaningless with temperature > 0. Use semantic similarity bands, judge scores, or task-success labels instead.

🔬 Under the Hood

Provider TTFT includes queue time in shared multi-tenant clusters; your p95 TTFT spike may be vendor-side congestion, not your prompt doubling. Always log provider_request_id for support tickets.

🎯 Interview Tip

"Why is LLM observability hard?" — Non-deterministic correctness, token-based cost, TTFT+decode latency split, silent model/index drift, failures masked as 200 OK. Mention trace spans per tool call and sampling for judges.

FAQ: Do we still need Datadog?

Yes—LLM traces complement APM. Keep JVM/Python service metrics in Datadog; export LLM spans to Langfuse or OpenTelemetry with gen_ai.* semantic conventions. Correlate via trace_id.

FAQ: Is temperature 0 enough for stability?

It reduces variance; it does not guarantee bitwise-identical outputs or protect against retrieval/index drift. Still log hashes and run online samples.

Logging schema: structured JSON per request

Structured logging is the foundation of LLM observability. Each completion emits one JSON row with trace_id, request_id, model, prompt_version, latency_ms, token counts, retrieval_ids, tool_calls, guardrail_decisions, and cost_usd—never raw PII.

Every LLM call—chat, embed, rerank, judge, moderation—should emit one structured log row and one trace span with the same request_id. If you cannot replay what the model saw, you cannot debug hallucinations, cost spikes, or guardrail false positives. Treat logs as a contract between app eng, ML, and platform: schema in git, enforced in code review, validated in staging.

Canonical fields per LLM call

Field	Required?	Example	Why
request_id	Yes	req_8f2a	Join app, RAG, guardrails, support ticket
trace_id	Yes	OpenTelemetry hex	Cross-service waterfall
messages or hash	Yes	Redacted JSON or prompt_sha256	Replay without storing raw PII
model	Yes	gpt-4o-2024-08-06	Detect silent provider swaps
tokens_in/out	Yes	1842 / 412	Cost + context pressure
latency_ms	Yes	ttft: 420, total: 3100	SLO dashboards
cost_usd	Yes	0.0184	Per-feature chargeback
retrieval_ids	RAG yes	["chk_91","chk_44"]	Debug wrong citations
tool_calls	Agents yes	[{name, args_hash, latency_ms}]	Loop detection
guardrail_decisions	Yes	[{stage, action, dimension}]	Join Guide 4 block rates
metadata	Yes	tenant, feature, prompt_version	Slice drift alerts

Messages: store vs hash

Production rarely logs full prompts with customer PII. Pattern: log messages_redacted (Presidio scrubbed) in prod; log full payload in staging; always log prompt_version, system_prompt_hash, and context_chunk_ids[] so you can reconstruct from versioned stores when investigating incidents.

Redaction levels

Environment	User content	Assistant content	Retrieval
Production	Mask PII; truncate >2k chars	Store hash + first 200 chars	chunk_id + score only
Staging	Full with synthetic data only	Full	Full chunk text
Debug (sampled)	Encrypted blob, 0.1% sample	Same	Same

Model and provider metadata

Log both model_requested and model_resolved after gateway routing. Bedrock and Azure often map friendly names to ARNs. Include provider_request_id for vendor support escalations during outage windows.

Tokens, latency, and cost

Compute cost_usd in-app from a versioned price table—do not rely on billing CSV latency. Break out cached_tokens when using prompt caching; subtract from marginal cost dashboards.

from pydantic import BaseModel, Field
from typing import Any

class LlmCallLog(BaseModel):
    request_id: str
    trace_id: str
    feature: str
    tenant_id: str
    prompt_version: str
    model_requested: str
    model_resolved: str
    provider: str
    tokens_in: int
    tokens_out: int
    cached_tokens: int = 0
    cost_usd: float
    latency_ms: dict[str, float]  # ttft, total, tools
    finish_reason: str
    messages_hash: str
    retrieval_ids: list[str] = Field(default_factory=list)
    tool_calls: list[dict[str, Any]] = Field(default_factory=list)
    guardrail_decisions: list[dict[str, str]] = Field(default_factory=list)
    metadata: dict[str, Any] = Field(default_factory=dict)

def emit_log(log: LlmCallLog) -> None:
    # Never log full PII — messages_hash + redacted preview only
    logger.info("llm_call", extra=log.model_dump())

public record LlmCallLog(
    String requestId, String traceId, String feature, String tenantId,
    String promptVersion, String modelRequested, String modelResolved,
    String provider, int tokensIn, int tokensOut, int cachedTokens,
    double costUsd, Map latencyMs, String finishReason,
    String messagesHash, Map metadata) {}

public void emitLog(LlmCallLog log) {
  // Never log full PII — messagesHash + redacted preview only
  structuredLog.info("llm_call", log);
}

from opentelemetry import trace

tracer = trace.get_tracer("llm-gateway")

def span_llm_call(model: str, messages_hash: str):
    return tracer.start_as_current_span(
        "chat.completions",
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": model,
            "gen_ai.prompt.hash": messages_hash,
            "feature": "refund_copilot",
        },
    )

# After response:
# span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
# span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)

public Span spanLlmCall(String model, String messagesHash) {
  return tracer.spanBuilder("chat.completions")
      .setAttribute("gen_ai.system", "openai")
      .setAttribute("gen_ai.request.model", model)
      .setAttribute("gen_ai.prompt.hash", messagesHash)
      .setAttribute("feature", "refund_copilot")
      .startSpan();
}

PRICES = {
    "gpt-4o-2024-08-06": {"in": 2.50, "out": 10.00},  # USD per 1M tokens
    "text-embedding-3-small": {"in": 0.02, "out": 0.0},
}

def cost_usd(model: str, tokens_in: int, tokens_out: int) -> float:
    p = PRICES[model]
    return (tokens_in * p["in"] + tokens_out * p["out"]) / 1_000_000

def enrich_with_cost(log: dict) -> dict:
    log["cost_usd"] = round(cost_usd(log["model"], log["tokens_in"], log["tokens_out"]), 6)
    return log

public double costUsd(String model, int tokensIn, int tokensOut) {
  var p = priceTable.get(model);
  return (tokensIn * p.inputPerMillion() + tokensOut * p.outputPerMillion()) / 1_000_000.0;
}

💡 Tip

Standardize on one metadata.feature enum—refund_copilot, ticket_summarizer—before you have 40 microservices logging service_name inconsistently.

🔒 Security

Never log API keys, raw OAuth tokens, or full PAN/SSN—even in ‘temporary’ debug flags. Use hash + encrypted object storage with TTL for sampled full prompts.

⚖️ Trade-off

Full message logging aids debug but increases GDPR surface and storage cost. Hash + versioned prompt registry + chunk IDs captures 90% of incidents with 10% of the risk.

Metadata per call: minimum viable tags

tenant_id — multi-tenant cost and quality isolation
prompt_version — git tag or registry semver
retrieval_index_version — vector store snapshot
guardrail_actions[] — from Guide 4 traces
user_locale — slice drift detection

FAQ: Log embeddings?

Log embedding_model, tokens_in, and query hash—not the 1536-d vector unless debugging index issues in staging.

LangSmith & Langfuse: traces, feedback, datasets

LangSmith (LangChain-native) and Langfuse (open-source, self-hostable) are the two most common LLM trace hubs. Both capture runs, attach feedback, link datasets to failures, and support prompt versioning—pick based on LangChain coupling and data residency.

Traces without feedback are archaeology. Production observability needs a place to store run trees, human thumbs, judge scores, and the golden rows those failures become. LangSmith and Langfuse both close that loop; they differ on deployment model, LangChain depth, and self-host flexibility.

LangSmith vs Langfuse

Dimension	LangSmith	Langfuse
Deployment	Cloud (LangChain)	Cloud or self-host (Docker/K8s)
LangChain integration	Automatic run trees	SDK decorators; works without LC
Runs & spans	Runs, child steps, playground replay	Traces, generations, sessions
Feedback	Human + API scores on runs	Scores linked to traces
Datasets	First-class; trace → example	Datasets + prompt registry
Best for	LangGraph monoliths, fast iteration	Regulated industries, multi-stack
Watch out	Vendor lock-in; trace volume cost	Self-host ops (Postgres + ClickHouse)

LangSmith

Deepest integration with LangChain: automatic run trees, playground from traces, datasets linked to failures. Strong for teams already on LangGraph. Price scales with traced tokens; use sampling in high-QPS chat.

Langfuse

Open-source trace UI with sessions, generations, scores, and prompt versions. Self-host for regulated industries; cloud for speed. Python/JS SDKs wrap manual spans without LangChain.

When to pick which

Situation	Pick	Why
LangChain / LangGraph monolith	LangSmith	Zero-config run trees + dataset export
Must self-host traces (HIPAA, EU)	Langfuse OSS	Data stays in your VPC
Polyglot stack (Java + Python)	Langfuse SDK	Framework-agnostic decorators
Need traces + CI eval in one UI	LangSmith	Datasets linked to failed runs
Cost/latency only, week-one bootstrap	Helicone (next section)	Proxy, not full trace hub

import os
from langsmith import Client
from langchain_openai import ChatOpenAI

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "refund-copilot"

client = Client()
llm = ChatOpenAI(model="gpt-4o-mini")

# Runs auto-export when LANGCHAIN_TRACING_V2=true
result = llm.invoke("What is the refund window?")

# Attach human feedback to the run
client.create_feedback(
    run_id=result.response_metadata["run_id"],
    key="helpfulness",
    score=0.85,
    comment="Correct policy, missing link",
)

// LangSmith REST — manual run log from non-LangChain Java service
HttpRequest req = HttpRequest.newBuilder()
    .uri(URI.create("https://api.smith.langchain.com/runs"))
    .header("x-api-key", langSmithKey)
    .POST(HttpRequest.BodyPublishers.ofString(runJson))
    .build();
// POST feedback: /feedback with run_id + score

from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()
def answer_refund_question(request_id: str, query: str) -> str:
    langfuse_context.update_current_trace(metadata={"request_id": request_id})
    chunks = retrieve(query)  # nest @observe on retrieve too
    response = llm.complete(build_messages(chunks, query))
    langfuse_context.score_current_trace(name="helpfulness", value=0.9)
    return response

// Langfuse Java SDK — manual span
LangfuseClient client = LangfuseClient.create(publicKey, secretKey);
client.trace(new TraceBody("refund-" + requestId)
    .metadata(Map.of("feature", "refund_copilot")));
// Log generation with model, usage, output

📦 Real World

Support platform team exported LangSmith failed runs to JSONL nightly—same rows became CI gold in Guide 6 within one sprint.

💰 Cost

LangSmith pricing tracks traced tokens—sample 5–10% of prod chat, 100% staging. Self-host Langfuse trades license for ~$800–2k/mo infra.

⚙️ Config

Set LANGCHAIN_PROJECT per feature (refund, billing, internal)—prevents trace UI noise and simplifies dataset export.

FAQ: LangSmith without LangChain?

Possible via REST/SDK manual runs, but awkward—pick Langfuse if not on LangChain.

Helicone: cost and latency proxy

Helicone sits as an OpenAI-compatible reverse proxy—swap base_url, get request logging, cost attribution, latency histograms, and caching headers in minutes. Complement (not replace) LangSmith/Langfuse when you need fast cost telemetry without full RAG span instrumentation.

Helicone sees every proxied completion: model, tokens, latency, status, and custom properties via headers. It cannot see retrieval or guardrail logic that runs before the proxy call—use it for cost/latency SLIs and budget alerts while SDK tracing captures the full waterfall.

Helicone capabilities

Feature	How	Use case
Request logging	Proxy all OpenAI traffic	Instant cost dashboard
Custom properties	Helicone-Property-* headers	Slice by tenant, feature, prompt_version
Response caching	Helicone-Cache-Enabled	Repeat FAQ queries; watch stale policy risk
Budget alerts	Helicone dashboard + webhooks	Page when daily spend > forecast
Latency tracking	Per-request TTFT + total	Provider incident detection

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {os.environ['HELICONE_API_KEY']}",
        "Helicone-Property-Feature": "refund_copilot",
        "Helicone-Property-Tenant": tenant_id,
        "Helicone-Property-Prompt-Version": prompt_version,
    },
)

OpenAIClient client = OpenAIOkHttpClient.builder()
    .baseUrl("https://oai.helicone.ai/v1")
    .addHeader("Helicone-Auth", "Bearer " + heliconeKey)
    .addHeader("Helicone-Property-Feature", "refund_copilot")
    .addHeader("Helicone-Property-Tenant", tenantId)
    .addHeader("Helicone-Property-Prompt-Version", promptVersion)
    .apiKey(openAiKey)
    .build();

Caching headers

Enable Helicone cache for idempotent FAQ lookups—pair with prompt_version in cache key via property headers. Invalidate cache on policy deploy, not just model deploy.

headers = {
    "Helicone-Auth": f"Bearer {helicone_key}",
    "Helicone-Cache-Enabled": "true",
    "Helicone-Cache-Bucket-Max-Size": "1000",
    "Helicone-Property-Prompt-Version": "refund_v12",
}
# Only cache temperature=0, no tool calls, no user-specific context

.addHeader("Helicone-Cache-Enabled", "true")
.addHeader("Helicone-Cache-Bucket-Max-Size", "1000")
.addHeader("Helicone-Property-Prompt-Version", "refund_v12")

Budget alerts

Configure Helicone webhooks for daily spend thresholds; forward to PagerDuty for hard caps and Slack for soft warnings. Mirror the same thresholds in your internal cost_usd logs so Helicone and app metrics agree.

⚖️ Trade-off

Helicone proxy is fastest to deploy but blind to pre-proxy retrieval and guardrail spans. Use alongside Langfuse/LangSmith for full waterfall.

⚠️ Pitfall

Caching policy answers without prompt_version in cache key—users get stale refund rules after deploy.

💰 Cost

Helicone free tier covers early prod; paid tiers scale with logged requests. ROI is catching one runaway agent loop before finance notices.

Dashboards & alerts: SLIs and thresholds

Production dashboards track five core SLIs: refusal rate, faithfulness proxy, P95 latency, error rate, and cost per request—plus guardrail block rate from Guide 4. Wire alert thresholds in Grafana or Datadog with runbooks, not Slack noise.

SLIs for LLM apps mirror microservices—latency, errors, saturation—with semantic fourth signals: quality and cost. Platform SRE owns TTFT and error budgets; product ML owns sampled judge scores and hallucination rates. Define SLOs per feature, not globally—a ticket summarizer can tolerate 8s p95 while checkout copilot targets 2s TTFT p95.

Core SLIs at a glance

SLI	Definition	Typical threshold	Source
Refusal rate	Guardrail + policy blocks / total requests	2–8% baseline; alert if 2× spike	guardrail_decisions
Faithfulness proxy	Sampled NLI/judge pass rate	≥88% weekly; alert if −8pts	Online eval sample
P95 latency	TTFT + total wall time	TTFT <800ms, total <4s	Helicone / app metrics
Error rate	429, 5xx, overflow, empty	<1% 5xx; <5% 429	Typed error_class
Cost / request	cost_usd p50/p99	Alert if p99 3× baseline 1h	Structured logs
Guardrail block rate	Blocks by dimension (Guide 4)	Injection spike → SecOps	dimensions_flagged

TTFT p95 and total latency

SLI	Definition	Typical SLO (chat)	Alert
TTFT p95	First token − request start	< 800 ms	Page if >1.2s 15m
Total p95	Last token − request start	< 4 s	Ticket if >6s 1h
Stream stall	Gap between chunks >2s	< 0.5% requests	Warn
Tool round-trip	Per tool span	< 1.5s p95	Per dependency

Error rate (typed)

Split 429, context_length_exceeded, guardrail blocks, empty completions, and JSON parse failures—lumping into “5xx rate” hides fixable product bugs.

Error class	Owner	Mitigation
Rate limit 429	Platform	Token bucket, queue, model fallback
Context overflow	App eng	Summarize thread, trim retrieval
Guardrail block	Safety	Expected; track false positive sample
Invalid JSON tool args	App eng	Repair prompt, tighten schema

Token and cost per feature

Dashboard: sum(cost_usd) by feature / DAU, plus tokens_in p95 as early warning for prompt bloat. Anomaly detection on daily spend per tenant catches infinite agent loops before finance does.

Hallucination sampling

You cannot run NLI faithfulness on 100% of prod traffic. Sample 1–5% stratified by locale and intent; run lightweight judge (gpt-4o-mini with retrieved-only context); store scores in trace UI. Trend weekly; page on 10pt drop vs rolling baseline.

from prometheus_client import Histogram, Counter

TTFT = Histogram("llm_ttft_seconds", "Time to first token", ["model", "feature"])
ERRORS = Counter("llm_errors_total", "LLM errors", ["error_class", "feature"])
COST = Counter("llm_cost_usd_total", "Estimated USD", ["feature", "tenant"])

def observe_call(meta: dict):
    TTFT.labels(meta["model"], meta["feature"]).observe(meta["latency_ms"]["ttft"] / 1000)
    COST.labels(meta["feature"], meta["tenant_id"]).inc(meta["cost_usd"])
    if meta.get("error_class"):
        ERRORS.labels(meta["error_class"], meta["feature"]).inc()

meterRegistry.summary("llm.ttft.seconds", "model", model, "feature", feature)
    .record(ttftMs / 1000.0);
meterRegistry.counter("llm.cost.usd.total", "feature", feature, "tenant", tenantId)
    .increment(costUsd);

import random
from faithfulness import nli_entailment  # your wrapper

SAMPLE_RATE = 0.03

def maybe_score_faithfulness(request_id: str, answer: str, chunks: list[str]) -> None:
    if random.random() > SAMPLE_RATE:
        return
    score = nli_entailment(answer, chunks)  # 0..1
    traces.log_score(request_id, name="faithfulness", value=score)
    metrics.gauge("llm.faithfulness.sample", score, tags=[f"req:{request_id[:8]}"])

public void maybeScoreFaithfulness(String requestId, String answer, List chunks) {
  if (ThreadLocalRandom.current().nextDouble() > SAMPLE_RATE) return;
  double score = faithfulnessClient.nliEntailment(answer, chunks);
  traces.logScore(requestId, "faithfulness", score);
}

# Pseudocode for multi-window burn alert
# SLO: TTFT p95 < 800ms over 30d
# Fast burn: 14.4x budget in 1h → page

def ttft_burn_rate(window_minutes: int) -> float:
    p95 = metrics.p95("llm.ttft_ms", window=f"{window_minutes}m")
    slo_ms = 800
    error_budget_consumed = max(0, p95 - slo_ms) / slo_ms
    return error_budget_consumed / (window_minutes / (30 * 24 * 60))

if ttft_burn_rate(60) > 14.4:
    pager.trigger("LLM TTFT SLO fast burn")

double ttftBurnRate(int windowMinutes) {
  double p95 = metrics.p95("llm.ttft_ms", windowMinutes);
  double errorBudget = Math.max(0, p95 - 800) / 800.0;
  return errorBudget / (windowMinutes / (30.0 * 24 * 60));
}

💡 Tip

Plot tokens_out vs total_latency_ms—linear correlation confirms decode-bound slowness; flat line implicates network or queue.

⚠️ Pitfall

Averaging faithfulness scores without volume weighting—a 0.2 score on 2 samples moves the weekly chart as much as 10k samples.

🔬 Under the Hood

Histogram buckets for TTFT should start at 50ms, not 1s—LLM TTFT distributions are heavy-tailed below 200ms with a long secondary bump from queueing.

Dashboard row (minimum)

TTFT p50/p95 by model and feature
Error rate by error_class
Daily cost_usd by tenant (top 10)
Faithfulness sample 7-day rolling avg
Guardrail block rate (from Guide 4)

Grafana / Datadog sketch

Export Prometheus metrics from your log pipeline or OTel collector. Typical dashboard rows: TTFT heatmap by model, refusal rate stacked by guardrail dimension, faithfulness 7-day rolling avg, cost/tenant top-N, error class breakdown. Datadog: use gen_ai.* facets if using OpenLLMetry; correlate with APM service map.

⚙️ Config

Link guardrail block-rate panel to Guide 4 dimensions—injection spikes route to SecOps; PII blocks route to privacy oncall. One “block rate” metric hides which rail failed.

Alert thresholds

LLM alerts fall into four buckets: errors (hard failures), latency (SLO burn), cost anomalies (spend without traffic), and quality (judge score drift, escalation spikes). Each needs a runbook link, an owner, and explicit “expected noise” for guardrail blocks during attacks.

Error alerts

Condition	Severity	Runbook action
429 rate >5% 5m	Page	Enable queue; switch fallback model
5xx from provider >1% 5m	Page	Status page; regional failover
Context overflow >10% 15m	Ticket	Trim retrieval; thread summarization
Empty completion >0.5%	Warn	Check filter/harm blocks misfiring

Latency alerts

Use multi-window burn rates on TTFT and total latency. Pair with tokens_out p95 —if latency alert fires without token spike, suspect provider incident not your deploy.

Cost anomaly alerts

Forecast daily spend with 7-day seasonal baseline; alert when actual >2σ or >120% of forecast. Slice by tenant_id to catch one customer’s agent loop, not fleet-wide noise.

Signal	Threshold	Likely cause
Cost/req p99 3× baseline	1 hour	Runaway tool loop, missing max_tokens
Embedding cost 2×	24 h	Re-embed cron duplicated
Judge token line item 5×	24 h	Sample rate config typo (0.3 vs 0.03)

Quality alerts

Quality signals lag—use 24h windows. Alert on faithfulness sample drop >8pts vs 7d baseline, CSAT drop correlated with model version tag, or support escalation keyword rate spike (“wrong policy”, “made up”).

ALERTS = [
    {"name": "llm_429_spike", "expr": "rate(llm_errors{class='429'}[5m]) > 0.05", "severity": "page"},
    {"name": "llm_cost_anomaly", "expr": "daily_cost > forecast * 1.2", "severity": "ticket"},
    {"name": "faithfulness_drop", "expr": "faithfulness_7d_avg < baseline - 0.08", "severity": "ticket"},
]

def route(alert: dict):
    if alert["severity"] == "page":
        pagerduty.trigger(alert["name"], runbook="wiki/llm-incident")
    else:
        slack.post("#llm-ops", alert["name"])

public void route(Alert alert) {
  if ("page".equals(alert.severity())) {
    pagerduty.trigger(alert.name(), "wiki/llm-incident");
  } else {
    slack.post("#llm-ops", alert.name());
  }
}

import statistics as stats

history = load_daily_cost(feature="refund_copilot", days=28)
today = current_daily_cost()

mu = stats.mean(history)
sigma = stats.pstdev(history) or 1.0
z = (today - mu) / sigma

if z > 2.5:
    fire_alert("cost_anomaly", {"z": z, "today_usd": today, "mean_usd": mu})

double mu = history.stream().mapToDouble(d -> d).average().orElse(0);
double sigma = stdDev(history);
double z = (today - mu) / Math.max(sigma, 1e-6);
if (z > 2.5) fireAlert("cost_anomaly", Map.of("z", z));

def check_quality_regression():
    for version in active_prompt_versions():
        score = metrics.avg("faithfulness", filter={"prompt_version": version}, window="24h")
        baseline = baselines.get(version)
        if score < baseline - 0.08:
            create_incident(
                title=f"Quality drop on {version}",
                owner="ml-oncall",
                links=[f"langfuse://prompts/{version}"],
            )

void checkQualityRegression() {
  for (String version : activePromptVersions()) {
    double score = metrics.avg("faithfulness", Map.of("prompt_version", version), "24h");
    double baseline = baselines.get(version);
    if (score < baseline - 0.08) createIncident(version, score);
  }
}

🔒 Security

Wire guardrail block spikes to SecOps when dimension includes injection or PII—not all blocks belong in #llm-ops noise.

📦 Real World

Cost anomaly caught a deploy that set max_tool_iterations=50 default—support bot burned $12k overnight calling search in a loop.

🎯 Interview Tip

"What do you alert on for LLM prod?" — Typed errors, TTFT SLO burn, cost/req anomalies, sampled quality drift—not BLEU on live traffic.

Runbook template

Acknowledge; check provider status and recent deploys
Compare TTFT vs tokens_out; open trace exemplars
Roll back prompt_version or enable fallback model
Post-mortem: add failing trace to golden set (Guide 6 CI)

Online evals: sample, score, promote to gold

Run online evals on 1–5% of production traffic: shadow scoring with rules and lightweight judges, a human review queue for borderline cases, automatic promotion of failures to the golden dataset, and week-over-week drift detection.

Offline CI cannot cover every user phrasing, locale, or adversarial pattern. Online eval closes the gap without bankrupting inference—sample stratified traffic, score asynchronously, and feed failures back to gold within days.

Sample 1–5% production traffic

Strategy	Rate	When
Head-based random	1–3%	Steady-state chat
Stratified by locale/intent	5% on thin slices	Bias audit, new market launch
100% on error/guardrail block	All failures	Always—cheap signal
100% staging	All	Pre-prod prompt experiments

Shadow scoring with rules

Shadow path runs after the user sees the response—never block on online judge latency. Apply cheap rules first: regex policy phrases, citation ID match against retrieval_ids, JSON schema re-validation, max length. Escalate to LLM-as-judge only when rules are inconclusive.

import random
from queue import Queue

shadow_queue: Queue = Queue()

def maybe_enqueue_shadow(log: dict) -> None:
    if log.get("guardrail_decisions") and any(d["action"] == "block" for d in log["guardrail_decisions"]):
        shadow_queue.put(log)  # 100% on blocks
        return
    if random.random() < 0.03:  # 3% sample
        shadow_queue.put(log)

def shadow_worker():
    while True:
        log = shadow_queue.get()
        rules = run_rule_checks(log)  # citation match, banned phrases
        if rules["passed"] is None:  # inconclusive
            rules["judge_score"] = llm_judge(log)
        store_online_eval_result(log["request_id"], rules)

void maybeEnqueueShadow(LlmCallLog log) {
  boolean blocked = log.guardrailDecisions().stream()
      .anyMatch(d -> "block".equals(d.get("action")));
  if (blocked || ThreadLocalRandom.current().nextDouble() < 0.03) {
    shadowQueue.offer(log);
  }
}

Human review queue

Route low-confidence shadow scores (judge 0.4–0.6, rule conflicts, user thumbs-down) to a review UI in LangSmith or Langfuse. Reviewers label pass/fail with rubric notes; one click promotes to golden JSONL with source=production and trace_id link.

Promote failures to golden dataset

Shadow score fails or human marks fail
Export row: input, expected policy, actual output, retrieval_ids, prompt_version
PR to evals/gold/ with reviewer sign-off
CI runs on merge—failure becomes regression gate in Guide 6

Drift detection week-over-week

Compare weekly aggregates: faithfulness sample mean, refusal rate by dimension, embedding centroid similarity of answers, CSAT on bot-handled tickets. Alert when any metric moves >2σ vs four-week baseline—not on single-day noise.

Drift signal	Window	Action
Faithfulness −8pts	7d vs prior 7d	Freeze prompt merges; open incident
Refusal rate 2×	24h	Check attack traffic vs false positives
Answer embedding centroid	Weekly	Index rot or model swap investigation
Escalation keyword rate	7d	Product review; add gold rows

💡 Tip

Tag every online eval row with prompt_version and model_resolved—drill-down without guessing which deploy caused drift.

⚠️ Pitfall

100% online LLM-as-judge in prod—doubles inference cost and adds judge bias. Rules first, judge on sample, human on borderline.

🎯 Interview Tip

"Online vs offline eval?" — Offline blocks bad merges; online detects drift on live traffic; failures promote to gold; closed loop in days not quarters.

🔬 Under the Hood

Shadow queue should be durable (SQS/Kafka)—dropping samples during traffic spikes hides the exact failures you need most.

Track 5 observability production checklist

Ship when structured logs, traces, dashboards, and alerts pass the Track 5 observability checklist—then connect exemplar failures to CI eval in the next guide.

Guide 5 closes when traces, dashboards, and alerts match the Track 5 observability checklist below—then wire regression gates in Guide 6 CI. Observability without CI regression is reactive; CI without production telemetry is blind to index rot and provider drift.

Track 5 observability checklist

Item	Done when	Owner
Canonical LlmCallLog schema in git	Code review enforced	Platform
Trace tool chosen + OTel export	Staging 100% sampled	Platform
TTFT + total latency dashboards	Per feature SLO lines drawn	SRE
Cost per tenant/feature	Daily chart + anomaly alert	FinOps + ML
Faithfulness 3% sample	7d baseline stored	ML
Guardrail fields joined on trace	From Guide 4 checklist	Safety
Incident runbook linked in alerts	On-call drill completed	SRE
PII redaction verified	Sec review sign-off	Security

Pre-CI handoff fields

Export these to your eval CI pipeline (next guide):

prompt_version, model_resolved
faithfulness.sample weekly avg
cost_usd per request for budget regression tests
Exemplar trace_id links on failed judge scores

CHECKS = {
    "schema_tests_pass": run_pytest("tests/test_llm_log_schema.py"),
    "staging_sample_rate": env_float("PROD_TRACE_SAMPLE") <= 0.1,
    "dashboard_exists": grafana.api.dashboard_uid("llm-ttft") is not None,
    "alert_linked": pagerduty.has_runbook("llm_429_spike"),
}

def observability_ready() -> bool:
    failed = [k for k, v in CHECKS.items() if not v]
    if failed:
        print("Blocked:", failed)
        return False
    return True

Map> checks = Map.of(
    "schema_tests_pass", () -> pytest.run("tests/LlmLogSchemaTest"),
    "staging_sample_rate", () -> prodTraceSample <= 0.1f
);
boolean observabilityReady() {
  return checks.entrySet().stream().allMatch(e -> e.getValue().get());
}

AGENDA = [
    "TTFT p95 vs SLO by feature",
    "Top 5 cost tenants + anomalies",
    "Faithfulness sample trend",
    "New error classes this week",
    "Traces promoted to golden set",
]
# 30min ML + SRE + product — outputs tickets, not slides

List AGENDA = List.of(
    "TTFT p95 vs SLO by feature",
    "Top 5 cost tenants + anomalies",
    "Faithfulness sample trend"
);

📦 End of Guide 5

You can explain LLM-specific observability gaps, log a canonical schema, pick tracing tooling, define SLIs, and alert with runbooks. Guide 6 adds CI evaluation—regression gates that block bad prompt merges before they reach the traffic you just instrumented.

🔒 Security

Checklist: verify prod logs contain zero raw PAN/SSN samples in weekly automated scan; alert if regex hits increase.

📦 Real World

Teams that complete this checklist before CI eval report 40% faster root-cause on prompt regressions—trace exemplars become golden rows automatically.

🎯 Interview Tip

"Observability stack for LLMs?" — Schema per call, trace waterfall, TTFT/cost SLIs, sampled judges, anomaly alerts, closed loop to eval datasets.

💰 Cost

Budget 1–3% of inference spend for observability SaaS + storage—cheaper than one undetected runaway agent weekend.

🔬 Under the Hood

Trace retention 7–14d hot, 90d cold object storage—long enough for weekly quality review, not a compliance lake by accident.

⚖️ Trade-off

100% tracing gives perfect forensics but doubles cost at scale—10% head-based sampling plus 100% on errors is the usual compromise.

⚠️ Pitfall

Shipping dashboards without alert runbooks—on-call acks pages then closes them; drift continues until CSAT collapses.

Track 5 guide sequence

Guide	Topic	You are here
Guide 4	Safety & guardrails	← prev
Guide 5	Observability & monitoring	✓
Guide 6	Continuous evaluation in CI	next →

FAQ: Sample rate for traces?

100% staging; prod 5–10% head sample plus 100% on error and guardrail block. Increase temporarily during launches.

FAQ: Who owns quality alerts?

ML oncall triages; product approves prompt rollback; platform owns latency/error pages.