Lang
API

Observability & monitoring

Offline golden evals prove your bot passed yesterday’s test suite. Observability proves it is still safe, faithful, and affordable in production today. This guide closes the eval loop: structured per-request logs, LangSmith and Langfuse traces, Helicone cost proxies, SLI dashboards tied to guardrail block rates, online eval sampling, drift detection, and the Track 5 observability checklist.

After reading, you should be able to: explain why offline evals are necessary but insufficient; emit a production-safe JSON log schema per LLM request; integrate LangSmith or Langfuse for traces and feedback; route inference through Helicone for cost and latency telemetry; define SLIs and alert thresholds for refusal rate, faithfulness, latency, errors, and cost; sample 1–5% of traffic for shadow scoring; promote production failures to golden datasets; and ship with the Track 5 observability production checklist.

developer platform architect Track 5 LangSmith Langfuse Helicone

Observability explained: why offline evals are not enough

Offline evals answer “did we pass the golden set we wrote?” Online observability answers “is production still healthy on traffic we never imagined?” This section contrasts offline vs online evaluation, maps the three pillars for LLM apps—traces, metrics, logs—and closes the eval loop from production failures back to gold.

Your refund copilot passes 200 golden cases in CI on Tuesday. Wednesday morning support volume doubles and pass-at-a-glance quality craters—same deploy, same prompt hash, same model string in config. Traditional APM shows green p99 latency and zero 5xx errors because every request returned HTTP 200 with a plausible paragraph. LLM observability exists because correctness is probabilistic, spend is per-token, and failure modes hide inside successful responses. Offline evals are necessary; they are not sufficient.

Offline vs online evaluation

DimensionOffline eval (CI / nightly)Online observability (prod)
TrafficFrozen golden JSONLLive user queries, seasonal spikes, attacks
LatencyBatch, unbounded wall timeTTFT p95, queue depth, user-facing SLOs
CostBudgeted job; capped N casesPer-request token spend; runaway agent loops
CorrectnessPass rate vs rubricSampled judges, thumbs, escalations, drift
Feedback loopHuman updates gold before mergeAuto-promote failures → gold within days
GuardrailsAdversarial gold in CIBlock-rate dashboards, injection spikes

Three pillars for LLM apps

Standard observability stacks still apply—but LLM apps need pillar-specific fields. Traces waterfall retrieval, tool calls, guardrail stages, and generation spans under one trace_id. Metrics aggregate refusal rate, faithfulness proxy, P95 latency, error rate, and cost per request. Logs emit structured JSON per request (next section) for audit, billing reconciliation, and incident replay without storing raw PII.

PillarLLM-specific signalExample tool
TracesSpan per retrieve / tool / generate / guardrailLangSmith, Langfuse, OTel
MetricsTTFT p95, tokens_in/out, cost_usd, block_ratePrometheus, Datadog, Grafana
Logsprompt_version, retrieval_ids, guardrail_decisionsJSON → Loki / CloudWatch / ELK
flowchart LR
  subgraph offline["Offline eval"]
    G[Golden JSONL] --> CI[CI / nightly suite]
    CI -->|pass| Ship[Deploy]
  end
  subgraph online["Online observability"]
    Prod[Production traffic] --> Log[Structured logs + traces]
    Log --> Dash[Dashboards + alerts]
    Dash --> Sample[Sample 1-5% shadow scoring]
  end
  Sample -->|failure| Review[Human review queue]
  Review --> G
  Dash -->|drift alert| Rollback[Prompt rollback / feature flag]

Closing the eval loop: production → gold

Mature teams treat production surprises as dataset candidates, not one-off tickets. A faithfulness alert fires → on-call opens trace exemplars → PM confirms bad answer → row added to gold/refund_policy_v3.jsonl within 48h → CI blocks the next regression. Observability without this loop produces pretty dashboards that never change prompts.

Non-deterministic outputs

Identical messages[], temperature 0, and model ID can still diverge across provider patches, batching order, and floating-point nondeterminism on some GPUs. You cannot assert response == expected in production—you score distributions, sample traces, and compare embedding centroids week over week. Offline evals give point estimates; online observability tracks drift of those estimates on live traffic slices.

  • Same input, different answer — user retries after a vague reply and gets contradictory policy guidance.
  • Silent model swap — provider routes gpt-4o-2024-08-06 to a newer snapshot without a version bump in your config.
  • Tool nondeterminism — agent calls search twice; second result set changes the final refund decision.

Latency variance (TTFT vs total)

LLM latency is bimodal: time-to-first-token (TTFT) reflects queue depth and prompt size; total latency adds variable decode length. A p99 total latency alert fires during a marketing push not because GPUs failed but because users ask longer questions and models emit longer answers. Split SLIs into ttft_ms, decode_ms, and tokens_out—otherwise you chase the wrong knob (scale replicas when you need max_tokens caps).

SignalTraditional APILLM callWhat to chart
SuccessHTTP 2xxHTTP 2xx + empty/refusal/hallucinationSemantic pass rate sample
LatencySingle wall timeTTFT + decode + tool round-tripsTTFT p50/p95, total p95
PayloadFixed JSON schemaVariable tokens in/outTokens per feature, context fill ratio
ErrorsExceptions, 5xxRate limits, context overflow, policy blockTyped error codes per vendor
CorrectnessDeterministic testsJudge scores, user thumbs, task successWeekly drift vs golden centroid

Cost unpredictability

A “simple” FAQ bot becomes expensive when retrieval returns 40 chunks, the model cites each, and a follow-up rewrites the whole thread into context. Cost scales with tokens in + tokens out + tool calls + judge evals, not QPS alone. Finance sees a 3× bill spike while traffic is flat—you need per-feature, per-tenant, per-model attribution in logs, not an aggregate CloudWatch billing alarm two days late.

Cost drivers to attribute per request

DriverLog fieldTypical surprise
Prompt bloattokens_inSystem prompt duplicated per turn
Runaway decodetokens_out, finish_reasonMissing max_tokens on tool loop
Retry stormattempt429 backoff not capped
Judge taxeval_judge_tokens100% online LLM-as-judge in prod
Embedding refreshembedding_tokensRe-embed on every query

Quality drift without deploys

Index rot, competitor docs changing, seasonal intents, and adversarial prompt patterns shift quality while git SHA is frozen. RAG retrieval recall drops when a PDF update splits tables across chunks—answers sound confident and wrong. Observability connects retrieval metrics (hit rate, MRR@k), generation metrics (faithfulness sample), and product metrics (escalation rate, CSAT) on shared request_id so you know which layer drifted.

Capture nondeterminism metadata on every completion
import hashlib
from dataclasses import dataclass, asdict

@dataclass
class LlmCallMeta:
    request_id: str
    model: str
    provider: str
    prompt_version: str
    temperature: float
    seed: int | None
    tokens_in: int
    tokens_out: int
    finish_reason: str
    response_hash: str  # sha256 of normalized text

def normalize(text: str) -> str:
    return " ".join(text.split()).lower()

def record_completion(req_id: str, model: str, text: str, usage, **kwargs) -> LlmCallMeta:
    meta = LlmCallMeta(
        request_id=req_id,
        model=model,
        provider=kwargs.get("provider", "openai"),
        prompt_version=kwargs["prompt_version"],
        temperature=kwargs.get("temperature", 0.0),
        seed=kwargs.get("seed"),
        tokens_in=usage.prompt_tokens,
        tokens_out=usage.completion_tokens,
        finish_reason=kwargs.get("finish_reason", "stop"),
        response_hash=hashlib.sha256(normalize(text).encode()).hexdigest()[:16],
    )
    logger.info("llm_completion", extra=asdict(meta))
    return meta
public record LlmCallMeta(
    String requestId, String model, String provider, String promptVersion,
    double temperature, Integer seed, int tokensIn, int tokensOut,
    String finishReason, String responseHash) {}

public LlmCallMeta recordCompletion(String reqId, String model, String text, Usage usage, Map kwargs) {
  String hash = sha256(normalize(text)).substring(0, 16);
  var meta = new LlmCallMeta(reqId, model, (String) kwargs.getOrDefault("provider", "openai"),
      (String) kwargs.get("prompt_version"), (double) kwargs.getOrDefault("temperature", 0.0),
      (Integer) kwargs.get("seed"), usage.promptTokens(), usage.completionTokens(),
      (String) kwargs.getOrDefault("finish_reason", "stop"), hash);
  structuredLog.info("llm_completion", meta);
  return meta;
}
Split latency timers: TTFT vs decode
import time
from contextlib import contextmanager

@contextmanager
def llm_stream_timers(request_id: str):
    marks = {"start": time.perf_counter(), "first_token": None}
    yield marks
    total_ms = (time.perf_counter() - marks["start"]) * 1000
    ttft_ms = (marks["first_token"] - marks["start"]) * 1000 if marks["first_token"] else total_ms
    decode_ms = total_ms - ttft_ms
    metrics.timing("llm.ttft_ms", ttft_ms, tags=[f"req:{request_id[:8]}"])
    metrics.timing("llm.decode_ms", decode_ms)
    metrics.timing("llm.total_ms", total_ms)

# In streaming handler:
# marks["first_token"] = time.perf_counter()  # on first SSE chunk
public class LlmStreamTimers implements AutoCloseable {
  private final long start = System.nanoTime();
  private Long firstTokenNs = null;

  public void onFirstToken() {
    if (firstTokenNs == null) firstTokenNs = System.nanoTime();
  }

  @Override
  public void close() {
    long totalMs = (System.nanoTime() - start) / 1_000_000;
    long ttftMs = firstTokenNs == null ? totalMs : (firstTokenNs - start) / 1_000_000;
    metrics.timing("llm.ttft_ms", ttftMs);
    metrics.timing("llm.decode_ms", totalMs - ttftMs);
    metrics.timing("llm.total_ms", totalMs);
  }
}
Weekly quality drift detector (embedding centroid)
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class DriftMonitor:
    def __init__(self, baseline_vectors: np.ndarray, threshold: float = 0.92):
        self.centroid = baseline_vectors.mean(axis=0)
        self.threshold = threshold

    def score_batch(self, live_vectors: np.ndarray) -> dict:
        live_centroid = live_vectors.mean(axis=0)
        sim = float(cosine_similarity([self.centroid], [live_centroid])[0, 0])
        return {"centroid_similarity": sim, "drift_alert": sim < self.threshold}

# Run nightly on sampled answer embeddings; page if drift_alert.
public record DriftScore(double centroidSimilarity, boolean driftAlert) {}

public DriftScore scoreBatch(double[][] liveVectors, double[] baselineCentroid, double threshold) {
  double[] liveCentroid = meanRows(liveVectors);
  double sim = cosineSimilarity(baselineCentroid, liveCentroid);
  return new DriftScore(sim, sim < threshold);
}
⚠️ Pitfall

Alerting on exact string match for regression—flaky and meaningless with temperature > 0. Use semantic similarity bands, judge scores, or task-success labels instead.

🔬 Under the Hood

Provider TTFT includes queue time in shared multi-tenant clusters; your p95 TTFT spike may be vendor-side congestion, not your prompt doubling. Always log provider_request_id for support tickets.

🎯 Interview Tip

"Why is LLM observability hard?" — Non-deterministic correctness, token-based cost, TTFT+decode latency split, silent model/index drift, failures masked as 200 OK. Mention trace spans per tool call and sampling for judges.

FAQ: Do we still need Datadog?

Yes—LLM traces complement APM. Keep JVM/Python service metrics in Datadog; export LLM spans to Langfuse or OpenTelemetry with gen_ai.* semantic conventions. Correlate via trace_id.

FAQ: Is temperature 0 enough for stability?

It reduces variance; it does not guarantee bitwise-identical outputs or protect against retrieval/index drift. Still log hashes and run online samples.

Logging schema: structured JSON per request

Structured logging is the foundation of LLM observability. Each completion emits one JSON row with trace_id, request_id, model, prompt_version, latency_ms, token counts, retrieval_ids, tool_calls, guardrail_decisions, and cost_usd—never raw PII.

Every LLM call—chat, embed, rerank, judge, moderation—should emit one structured log row and one trace span with the same request_id. If you cannot replay what the model saw, you cannot debug hallucinations, cost spikes, or guardrail false positives. Treat logs as a contract between app eng, ML, and platform: schema in git, enforced in code review, validated in staging.

Canonical fields per LLM call

FieldRequired?ExampleWhy
request_idYesreq_8f2aJoin app, RAG, guardrails, support ticket
trace_idYesOpenTelemetry hexCross-service waterfall
messages or hashYesRedacted JSON or prompt_sha256Replay without storing raw PII
modelYesgpt-4o-2024-08-06Detect silent provider swaps
tokens_in/outYes1842 / 412Cost + context pressure
latency_msYesttft: 420, total: 3100SLO dashboards
cost_usdYes0.0184Per-feature chargeback
retrieval_idsRAG yes["chk_91","chk_44"]Debug wrong citations
tool_callsAgents yes[{name, args_hash, latency_ms}]Loop detection
guardrail_decisionsYes[{stage, action, dimension}]Join Guide 4 block rates
metadataYestenant, feature, prompt_versionSlice drift alerts

Messages: store vs hash

Production rarely logs full prompts with customer PII. Pattern: log messages_redacted (Presidio scrubbed) in prod; log full payload in staging; always log prompt_version, system_prompt_hash, and context_chunk_ids[] so you can reconstruct from versioned stores when investigating incidents.

Redaction levels

EnvironmentUser contentAssistant contentRetrieval
ProductionMask PII; truncate >2k charsStore hash + first 200 charschunk_id + score only
StagingFull with synthetic data onlyFullFull chunk text
Debug (sampled)Encrypted blob, 0.1% sampleSameSame

Model and provider metadata

Log both model_requested and model_resolved after gateway routing. Bedrock and Azure often map friendly names to ARNs. Include provider_request_id for vendor support escalations during outage windows.

Tokens, latency, and cost

Compute cost_usd in-app from a versioned price table—do not rely on billing CSV latency. Break out cached_tokens when using prompt caching; subtract from marginal cost dashboards.

Structured log schema (Pydantic)
from pydantic import BaseModel, Field
from typing import Any

class LlmCallLog(BaseModel):
    request_id: str
    trace_id: str
    feature: str
    tenant_id: str
    prompt_version: str
    model_requested: str
    model_resolved: str
    provider: str
    tokens_in: int
    tokens_out: int
    cached_tokens: int = 0
    cost_usd: float
    latency_ms: dict[str, float]  # ttft, total, tools
    finish_reason: str
    messages_hash: str
    retrieval_ids: list[str] = Field(default_factory=list)
    tool_calls: list[dict[str, Any]] = Field(default_factory=list)
    guardrail_decisions: list[dict[str, str]] = Field(default_factory=list)
    metadata: dict[str, Any] = Field(default_factory=dict)

def emit_log(log: LlmCallLog) -> None:
    # Never log full PII — messages_hash + redacted preview only
    logger.info("llm_call", extra=log.model_dump())
public record LlmCallLog(
    String requestId, String traceId, String feature, String tenantId,
    String promptVersion, String modelRequested, String modelResolved,
    String provider, int tokensIn, int tokensOut, int cachedTokens,
    double costUsd, Map latencyMs, String finishReason,
    String messagesHash, Map metadata) {}

public void emitLog(LlmCallLog log) {
  // Never log full PII — messagesHash + redacted preview only
  structuredLog.info("llm_call", log);
}
OpenTelemetry GenAI semantic attributes
from opentelemetry import trace

tracer = trace.get_tracer("llm-gateway")

def span_llm_call(model: str, messages_hash: str):
    return tracer.start_as_current_span(
        "chat.completions",
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": model,
            "gen_ai.prompt.hash": messages_hash,
            "feature": "refund_copilot",
        },
    )

# After response:
# span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
# span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
public Span spanLlmCall(String model, String messagesHash) {
  return tracer.spanBuilder("chat.completions")
      .setAttribute("gen_ai.system", "openai")
      .setAttribute("gen_ai.request.model", model)
      .setAttribute("gen_ai.prompt.hash", messagesHash)
      .setAttribute("feature", "refund_copilot")
      .startSpan();
}
Cost calculator from versioned price table
PRICES = {
    "gpt-4o-2024-08-06": {"in": 2.50, "out": 10.00},  # USD per 1M tokens
    "text-embedding-3-small": {"in": 0.02, "out": 0.0},
}

def cost_usd(model: str, tokens_in: int, tokens_out: int) -> float:
    p = PRICES[model]
    return (tokens_in * p["in"] + tokens_out * p["out"]) / 1_000_000

def enrich_with_cost(log: dict) -> dict:
    log["cost_usd"] = round(cost_usd(log["model"], log["tokens_in"], log["tokens_out"]), 6)
    return log
public double costUsd(String model, int tokensIn, int tokensOut) {
  var p = priceTable.get(model);
  return (tokensIn * p.inputPerMillion() + tokensOut * p.outputPerMillion()) / 1_000_000.0;
}
💡 Tip

Standardize on one metadata.feature enum—refund_copilot, ticket_summarizer—before you have 40 microservices logging service_name inconsistently.

🔒 Security

Never log API keys, raw OAuth tokens, or full PAN/SSN—even in ‘temporary’ debug flags. Use hash + encrypted object storage with TTL for sampled full prompts.

⚖️ Trade-off

Full message logging aids debug but increases GDPR surface and storage cost. Hash + versioned prompt registry + chunk IDs captures 90% of incidents with 10% of the risk.

Metadata per call: minimum viable tags

  • tenant_id — multi-tenant cost and quality isolation
  • prompt_version — git tag or registry semver
  • retrieval_index_version — vector store snapshot
  • guardrail_actions[] — from Guide 4 traces
  • user_locale — slice drift detection

FAQ: Log embeddings?

Log embedding_model, tokens_in, and query hash—not the 1536-d vector unless debugging index issues in staging.

LangSmith & Langfuse: traces, feedback, datasets

LangSmith (LangChain-native) and Langfuse (open-source, self-hostable) are the two most common LLM trace hubs. Both capture runs, attach feedback, link datasets to failures, and support prompt versioning—pick based on LangChain coupling and data residency.

Traces without feedback are archaeology. Production observability needs a place to store run trees, human thumbs, judge scores, and the golden rows those failures become. LangSmith and Langfuse both close that loop; they differ on deployment model, LangChain depth, and self-host flexibility.

LangSmith vs Langfuse

DimensionLangSmithLangfuse
DeploymentCloud (LangChain)Cloud or self-host (Docker/K8s)
LangChain integrationAutomatic run treesSDK decorators; works without LC
Runs & spansRuns, child steps, playground replayTraces, generations, sessions
FeedbackHuman + API scores on runsScores linked to traces
DatasetsFirst-class; trace → exampleDatasets + prompt registry
Best forLangGraph monoliths, fast iterationRegulated industries, multi-stack
Watch outVendor lock-in; trace volume costSelf-host ops (Postgres + ClickHouse)

LangSmith

Deepest integration with LangChain: automatic run trees, playground from traces, datasets linked to failures. Strong for teams already on LangGraph. Price scales with traced tokens; use sampling in high-QPS chat.

Langfuse

Open-source trace UI with sessions, generations, scores, and prompt versions. Self-host for regulated industries; cloud for speed. Python/JS SDKs wrap manual spans without LangChain.

When to pick which

SituationPickWhy
LangChain / LangGraph monolithLangSmithZero-config run trees + dataset export
Must self-host traces (HIPAA, EU)Langfuse OSSData stays in your VPC
Polyglot stack (Java + Python)Langfuse SDKFramework-agnostic decorators
Need traces + CI eval in one UILangSmithDatasets linked to failed runs
Cost/latency only, week-one bootstrapHelicone (next section)Proxy, not full trace hub
LangSmith trace + feedback (Python)
import os
from langsmith import Client
from langchain_openai import ChatOpenAI

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "refund-copilot"

client = Client()
llm = ChatOpenAI(model="gpt-4o-mini")

# Runs auto-export when LANGCHAIN_TRACING_V2=true
result = llm.invoke("What is the refund window?")

# Attach human feedback to the run
client.create_feedback(
    run_id=result.response_metadata["run_id"],
    key="helpfulness",
    score=0.85,
    comment="Correct policy, missing link",
)
// LangSmith REST — manual run log from non-LangChain Java service
HttpRequest req = HttpRequest.newBuilder()
    .uri(URI.create("https://api.smith.langchain.com/runs"))
    .header("x-api-key", langSmithKey)
    .POST(HttpRequest.BodyPublishers.ofString(runJson))
    .build();
// POST feedback: /feedback with run_id + score
Langfuse manual trace (Python SDK)
from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()
def answer_refund_question(request_id: str, query: str) -> str:
    langfuse_context.update_current_trace(metadata={"request_id": request_id})
    chunks = retrieve(query)  # nest @observe on retrieve too
    response = llm.complete(build_messages(chunks, query))
    langfuse_context.score_current_trace(name="helpfulness", value=0.9)
    return response
// Langfuse Java SDK — manual span
LangfuseClient client = LangfuseClient.create(publicKey, secretKey);
client.trace(new TraceBody("refund-" + requestId)
    .metadata(Map.of("feature", "refund_copilot")));
// Log generation with model, usage, output
📦 Real World

Support platform team exported LangSmith failed runs to JSONL nightly—same rows became CI gold in Guide 6 within one sprint.

💰 Cost

LangSmith pricing tracks traced tokens—sample 5–10% of prod chat, 100% staging. Self-host Langfuse trades license for ~$800–2k/mo infra.

⚙️ Config

Set LANGCHAIN_PROJECT per feature (refund, billing, internal)—prevents trace UI noise and simplifies dataset export.

FAQ: LangSmith without LangChain?

Possible via REST/SDK manual runs, but awkward—pick Langfuse if not on LangChain.

Helicone: cost and latency proxy

Helicone sits as an OpenAI-compatible reverse proxy—swap base_url, get request logging, cost attribution, latency histograms, and caching headers in minutes. Complement (not replace) LangSmith/Langfuse when you need fast cost telemetry without full RAG span instrumentation.

Helicone sees every proxied completion: model, tokens, latency, status, and custom properties via headers. It cannot see retrieval or guardrail logic that runs before the proxy call—use it for cost/latency SLIs and budget alerts while SDK tracing captures the full waterfall.

Helicone capabilities

FeatureHowUse case
Request loggingProxy all OpenAI trafficInstant cost dashboard
Custom propertiesHelicone-Property-* headersSlice by tenant, feature, prompt_version
Response cachingHelicone-Cache-EnabledRepeat FAQ queries; watch stale policy risk
Budget alertsHelicone dashboard + webhooksPage when daily spend > forecast
Latency trackingPer-request TTFT + totalProvider incident detection
Helicone OpenAI proxy
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {os.environ['HELICONE_API_KEY']}",
        "Helicone-Property-Feature": "refund_copilot",
        "Helicone-Property-Tenant": tenant_id,
        "Helicone-Property-Prompt-Version": prompt_version,
    },
)
OpenAIClient client = OpenAIOkHttpClient.builder()
    .baseUrl("https://oai.helicone.ai/v1")
    .addHeader("Helicone-Auth", "Bearer " + heliconeKey)
    .addHeader("Helicone-Property-Feature", "refund_copilot")
    .addHeader("Helicone-Property-Tenant", tenantId)
    .addHeader("Helicone-Property-Prompt-Version", promptVersion)
    .apiKey(openAiKey)
    .build();

Caching headers

Enable Helicone cache for idempotent FAQ lookups—pair with prompt_version in cache key via property headers. Invalidate cache on policy deploy, not just model deploy.

Helicone response caching
headers = {
    "Helicone-Auth": f"Bearer {helicone_key}",
    "Helicone-Cache-Enabled": "true",
    "Helicone-Cache-Bucket-Max-Size": "1000",
    "Helicone-Property-Prompt-Version": "refund_v12",
}
# Only cache temperature=0, no tool calls, no user-specific context
.addHeader("Helicone-Cache-Enabled", "true")
.addHeader("Helicone-Cache-Bucket-Max-Size", "1000")
.addHeader("Helicone-Property-Prompt-Version", "refund_v12")

Budget alerts

Configure Helicone webhooks for daily spend thresholds; forward to PagerDuty for hard caps and Slack for soft warnings. Mirror the same thresholds in your internal cost_usd logs so Helicone and app metrics agree.

⚖️ Trade-off

Helicone proxy is fastest to deploy but blind to pre-proxy retrieval and guardrail spans. Use alongside Langfuse/LangSmith for full waterfall.

⚠️ Pitfall

Caching policy answers without prompt_version in cache key—users get stale refund rules after deploy.

💰 Cost

Helicone free tier covers early prod; paid tiers scale with logged requests. ROI is catching one runaway agent loop before finance notices.

Dashboards & alerts: SLIs and thresholds

Production dashboards track five core SLIs: refusal rate, faithfulness proxy, P95 latency, error rate, and cost per request—plus guardrail block rate from Guide 4. Wire alert thresholds in Grafana or Datadog with runbooks, not Slack noise.

SLIs for LLM apps mirror microservices—latency, errors, saturation—with semantic fourth signals: quality and cost. Platform SRE owns TTFT and error budgets; product ML owns sampled judge scores and hallucination rates. Define SLOs per feature, not globally—a ticket summarizer can tolerate 8s p95 while checkout copilot targets 2s TTFT p95.

Core SLIs at a glance

SLIDefinitionTypical thresholdSource
Refusal rateGuardrail + policy blocks / total requests2–8% baseline; alert if 2× spikeguardrail_decisions
Faithfulness proxySampled NLI/judge pass rate≥88% weekly; alert if −8ptsOnline eval sample
P95 latencyTTFT + total wall timeTTFT <800ms, total <4sHelicone / app metrics
Error rate429, 5xx, overflow, empty<1% 5xx; <5% 429Typed error_class
Cost / requestcost_usd p50/p99Alert if p99 3× baseline 1hStructured logs
Guardrail block rateBlocks by dimension (Guide 4)Injection spike → SecOpsdimensions_flagged

TTFT p95 and total latency

SLIDefinitionTypical SLO (chat)Alert
TTFT p95First token − request start< 800 msPage if >1.2s 15m
Total p95Last token − request start< 4 sTicket if >6s 1h
Stream stallGap between chunks >2s< 0.5% requestsWarn
Tool round-tripPer tool span< 1.5s p95Per dependency

Error rate (typed)

Split 429, context_length_exceeded, guardrail blocks, empty completions, and JSON parse failures—lumping into “5xx rate” hides fixable product bugs.

Error classOwnerMitigation
Rate limit 429PlatformToken bucket, queue, model fallback
Context overflowApp engSummarize thread, trim retrieval
Guardrail blockSafetyExpected; track false positive sample
Invalid JSON tool argsApp engRepair prompt, tighten schema

Token and cost per feature

Dashboard: sum(cost_usd) by feature / DAU, plus tokens_in p95 as early warning for prompt bloat. Anomaly detection on daily spend per tenant catches infinite agent loops before finance does.

Hallucination sampling

You cannot run NLI faithfulness on 100% of prod traffic. Sample 1–5% stratified by locale and intent; run lightweight judge (gpt-4o-mini with retrieved-only context); store scores in trace UI. Trend weekly; page on 10pt drop vs rolling baseline.

Prometheus-style metrics emission
from prometheus_client import Histogram, Counter

TTFT = Histogram("llm_ttft_seconds", "Time to first token", ["model", "feature"])
ERRORS = Counter("llm_errors_total", "LLM errors", ["error_class", "feature"])
COST = Counter("llm_cost_usd_total", "Estimated USD", ["feature", "tenant"])

def observe_call(meta: dict):
    TTFT.labels(meta["model"], meta["feature"]).observe(meta["latency_ms"]["ttft"] / 1000)
    COST.labels(meta["feature"], meta["tenant_id"]).inc(meta["cost_usd"])
    if meta.get("error_class"):
        ERRORS.labels(meta["error_class"], meta["feature"]).inc()
meterRegistry.summary("llm.ttft.seconds", "model", model, "feature", feature)
    .record(ttftMs / 1000.0);
meterRegistry.counter("llm.cost.usd.total", "feature", feature, "tenant", tenantId)
    .increment(costUsd);
Online hallucination sampler
import random
from faithfulness import nli_entailment  # your wrapper

SAMPLE_RATE = 0.03

def maybe_score_faithfulness(request_id: str, answer: str, chunks: list[str]) -> None:
    if random.random() > SAMPLE_RATE:
        return
    score = nli_entailment(answer, chunks)  # 0..1
    traces.log_score(request_id, name="faithfulness", value=score)
    metrics.gauge("llm.faithfulness.sample", score, tags=[f"req:{request_id[:8]}"])
public void maybeScoreFaithfulness(String requestId, String answer, List chunks) {
  if (ThreadLocalRandom.current().nextDouble() > SAMPLE_RATE) return;
  double score = faithfulnessClient.nliEntailment(answer, chunks);
  traces.logScore(requestId, "faithfulness", score);
}
SLO burn rate alert (TTFT)
# Pseudocode for multi-window burn alert
# SLO: TTFT p95 < 800ms over 30d
# Fast burn: 14.4x budget in 1h → page

def ttft_burn_rate(window_minutes: int) -> float:
    p95 = metrics.p95("llm.ttft_ms", window=f"{window_minutes}m")
    slo_ms = 800
    error_budget_consumed = max(0, p95 - slo_ms) / slo_ms
    return error_budget_consumed / (window_minutes / (30 * 24 * 60))

if ttft_burn_rate(60) > 14.4:
    pager.trigger("LLM TTFT SLO fast burn")
double ttftBurnRate(int windowMinutes) {
  double p95 = metrics.p95("llm.ttft_ms", windowMinutes);
  double errorBudget = Math.max(0, p95 - 800) / 800.0;
  return errorBudget / (windowMinutes / (30.0 * 24 * 60));
}
💡 Tip

Plot tokens_out vs total_latency_ms—linear correlation confirms decode-bound slowness; flat line implicates network or queue.

⚠️ Pitfall

Averaging faithfulness scores without volume weighting—a 0.2 score on 2 samples moves the weekly chart as much as 10k samples.

🔬 Under the Hood

Histogram buckets for TTFT should start at 50ms, not 1s—LLM TTFT distributions are heavy-tailed below 200ms with a long secondary bump from queueing.

Dashboard row (minimum)

  • TTFT p50/p95 by model and feature
  • Error rate by error_class
  • Daily cost_usd by tenant (top 10)
  • Faithfulness sample 7-day rolling avg
  • Guardrail block rate (from Guide 4)

Grafana / Datadog sketch

Export Prometheus metrics from your log pipeline or OTel collector. Typical dashboard rows: TTFT heatmap by model, refusal rate stacked by guardrail dimension, faithfulness 7-day rolling avg, cost/tenant top-N, error class breakdown. Datadog: use gen_ai.* facets if using OpenLLMetry; correlate with APM service map.

⚙️ Config

Link guardrail block-rate panel to Guide 4 dimensions—injection spikes route to SecOps; PII blocks route to privacy oncall. One “block rate” metric hides which rail failed.

Alert thresholds

LLM alerts fall into four buckets: errors (hard failures), latency (SLO burn), cost anomalies (spend without traffic), and quality (judge score drift, escalation spikes). Each needs a runbook link, an owner, and explicit “expected noise” for guardrail blocks during attacks.

Error alerts

ConditionSeverityRunbook action
429 rate >5% 5mPageEnable queue; switch fallback model
5xx from provider >1% 5mPageStatus page; regional failover
Context overflow >10% 15mTicketTrim retrieval; thread summarization
Empty completion >0.5%WarnCheck filter/harm blocks misfiring

Latency alerts

Use multi-window burn rates on TTFT and total latency. Pair with tokens_out p95 —if latency alert fires without token spike, suspect provider incident not your deploy.

Cost anomaly alerts

Forecast daily spend with 7-day seasonal baseline; alert when actual >2σ or >120% of forecast. Slice by tenant_id to catch one customer’s agent loop, not fleet-wide noise.

SignalThresholdLikely cause
Cost/req p99 3× baseline1 hourRunaway tool loop, missing max_tokens
Embedding cost 2×24 hRe-embed cron duplicated
Judge token line item 5×24 hSample rate config typo (0.3 vs 0.03)

Quality alerts

Quality signals lag—use 24h windows. Alert on faithfulness sample drop >8pts vs 7d baseline, CSAT drop correlated with model version tag, or support escalation keyword rate spike (“wrong policy”, “made up”).

Alertmanager-style routing by severity
ALERTS = [
    {"name": "llm_429_spike", "expr": "rate(llm_errors{class='429'}[5m]) > 0.05", "severity": "page"},
    {"name": "llm_cost_anomaly", "expr": "daily_cost > forecast * 1.2", "severity": "ticket"},
    {"name": "faithfulness_drop", "expr": "faithfulness_7d_avg < baseline - 0.08", "severity": "ticket"},
]

def route(alert: dict):
    if alert["severity"] == "page":
        pagerduty.trigger(alert["name"], runbook="wiki/llm-incident")
    else:
        slack.post("#llm-ops", alert["name"])
public void route(Alert alert) {
  if ("page".equals(alert.severity())) {
    pagerduty.trigger(alert.name(), "wiki/llm-incident");
  } else {
    slack.post("#llm-ops", alert.name());
  }
}
Cost anomaly (simple Z-score)
import statistics as stats

history = load_daily_cost(feature="refund_copilot", days=28)
today = current_daily_cost()

mu = stats.mean(history)
sigma = stats.pstdev(history) or 1.0
z = (today - mu) / sigma

if z > 2.5:
    fire_alert("cost_anomaly", {"z": z, "today_usd": today, "mean_usd": mu})
double mu = history.stream().mapToDouble(d -> d).average().orElse(0);
double sigma = stdDev(history);
double z = (today - mu) / Math.max(sigma, 1e-6);
if (z > 2.5) fireAlert("cost_anomaly", Map.of("z", z));
Quality alert tied to prompt_version
def check_quality_regression():
    for version in active_prompt_versions():
        score = metrics.avg("faithfulness", filter={"prompt_version": version}, window="24h")
        baseline = baselines.get(version)
        if score < baseline - 0.08:
            create_incident(
                title=f"Quality drop on {version}",
                owner="ml-oncall",
                links=[f"langfuse://prompts/{version}"],
            )
void checkQualityRegression() {
  for (String version : activePromptVersions()) {
    double score = metrics.avg("faithfulness", Map.of("prompt_version", version), "24h");
    double baseline = baselines.get(version);
    if (score < baseline - 0.08) createIncident(version, score);
  }
}
🔒 Security

Wire guardrail block spikes to SecOps when dimension includes injection or PII—not all blocks belong in #llm-ops noise.

📦 Real World

Cost anomaly caught a deploy that set max_tool_iterations=50 default—support bot burned $12k overnight calling search in a loop.

🎯 Interview Tip

"What do you alert on for LLM prod?" — Typed errors, TTFT SLO burn, cost/req anomalies, sampled quality drift—not BLEU on live traffic.

Runbook template

  1. Acknowledge; check provider status and recent deploys
  2. Compare TTFT vs tokens_out; open trace exemplars
  3. Roll back prompt_version or enable fallback model
  4. Post-mortem: add failing trace to golden set (Guide 6 CI)

Online evals: sample, score, promote to gold

Run online evals on 1–5% of production traffic: shadow scoring with rules and lightweight judges, a human review queue for borderline cases, automatic promotion of failures to the golden dataset, and week-over-week drift detection.

Offline CI cannot cover every user phrasing, locale, or adversarial pattern. Online eval closes the gap without bankrupting inference—sample stratified traffic, score asynchronously, and feed failures back to gold within days.

Sample 1–5% production traffic

StrategyRateWhen
Head-based random1–3%Steady-state chat
Stratified by locale/intent5% on thin slicesBias audit, new market launch
100% on error/guardrail blockAll failuresAlways—cheap signal
100% stagingAllPre-prod prompt experiments

Shadow scoring with rules

Shadow path runs after the user sees the response—never block on online judge latency. Apply cheap rules first: regex policy phrases, citation ID match against retrieval_ids, JSON schema re-validation, max length. Escalate to LLM-as-judge only when rules are inconclusive.

Async shadow scorer (sampled)
import random
from queue import Queue

shadow_queue: Queue = Queue()

def maybe_enqueue_shadow(log: dict) -> None:
    if log.get("guardrail_decisions") and any(d["action"] == "block" for d in log["guardrail_decisions"]):
        shadow_queue.put(log)  # 100% on blocks
        return
    if random.random() < 0.03:  # 3% sample
        shadow_queue.put(log)

def shadow_worker():
    while True:
        log = shadow_queue.get()
        rules = run_rule_checks(log)  # citation match, banned phrases
        if rules["passed"] is None:  # inconclusive
            rules["judge_score"] = llm_judge(log)
        store_online_eval_result(log["request_id"], rules)
void maybeEnqueueShadow(LlmCallLog log) {
  boolean blocked = log.guardrailDecisions().stream()
      .anyMatch(d -> "block".equals(d.get("action")));
  if (blocked || ThreadLocalRandom.current().nextDouble() < 0.03) {
    shadowQueue.offer(log);
  }
}

Human review queue

Route low-confidence shadow scores (judge 0.4–0.6, rule conflicts, user thumbs-down) to a review UI in LangSmith or Langfuse. Reviewers label pass/fail with rubric notes; one click promotes to golden JSONL with source=production and trace_id link.

Promote failures to golden dataset

  1. Shadow score fails or human marks fail
  2. Export row: input, expected policy, actual output, retrieval_ids, prompt_version
  3. PR to evals/gold/ with reviewer sign-off
  4. CI runs on merge—failure becomes regression gate in Guide 6

Drift detection week-over-week

Compare weekly aggregates: faithfulness sample mean, refusal rate by dimension, embedding centroid similarity of answers, CSAT on bot-handled tickets. Alert when any metric moves >2σ vs four-week baseline—not on single-day noise.

Drift signalWindowAction
Faithfulness −8pts7d vs prior 7dFreeze prompt merges; open incident
Refusal rate 2×24hCheck attack traffic vs false positives
Answer embedding centroidWeeklyIndex rot or model swap investigation
Escalation keyword rate7dProduct review; add gold rows
💡 Tip

Tag every online eval row with prompt_version and model_resolved—drill-down without guessing which deploy caused drift.

⚠️ Pitfall

100% online LLM-as-judge in prod—doubles inference cost and adds judge bias. Rules first, judge on sample, human on borderline.

🎯 Interview Tip

"Online vs offline eval?" — Offline blocks bad merges; online detects drift on live traffic; failures promote to gold; closed loop in days not quarters.

🔬 Under the Hood

Shadow queue should be durable (SQS/Kafka)—dropping samples during traffic spikes hides the exact failures you need most.

Track 5 observability production checklist

Ship when structured logs, traces, dashboards, and alerts pass the Track 5 observability checklist—then connect exemplar failures to CI eval in the next guide.

Guide 5 closes when traces, dashboards, and alerts match the Track 5 observability checklist below—then wire regression gates in Guide 6 CI. Observability without CI regression is reactive; CI without production telemetry is blind to index rot and provider drift.

Track 5 observability checklist

ItemDone whenOwner
Canonical LlmCallLog schema in gitCode review enforcedPlatform
Trace tool chosen + OTel exportStaging 100% sampledPlatform
TTFT + total latency dashboardsPer feature SLO lines drawnSRE
Cost per tenant/featureDaily chart + anomaly alertFinOps + ML
Faithfulness 3% sample7d baseline storedML
Guardrail fields joined on traceFrom Guide 4 checklistSafety
Incident runbook linked in alertsOn-call drill completedSRE
PII redaction verifiedSec review sign-offSecurity

Pre-CI handoff fields

Export these to your eval CI pipeline (next guide):

  • prompt_version, model_resolved
  • faithfulness.sample weekly avg
  • cost_usd per request for budget regression tests
  • Exemplar trace_id links on failed judge scores
Observability readiness gate (CI script)
CHECKS = {
    "schema_tests_pass": run_pytest("tests/test_llm_log_schema.py"),
    "staging_sample_rate": env_float("PROD_TRACE_SAMPLE") <= 0.1,
    "dashboard_exists": grafana.api.dashboard_uid("llm-ttft") is not None,
    "alert_linked": pagerduty.has_runbook("llm_429_spike"),
}

def observability_ready() -> bool:
    failed = [k for k, v in CHECKS.items() if not v]
    if failed:
        print("Blocked:", failed)
        return False
    return True
Map> checks = Map.of(
    "schema_tests_pass", () -> pytest.run("tests/LlmLogSchemaTest"),
    "staging_sample_rate", () -> prodTraceSample <= 0.1f
);
boolean observabilityReady() {
  return checks.entrySet().stream().allMatch(e -> e.getValue().get());
}
Weekly observability review agenda
AGENDA = [
    "TTFT p95 vs SLO by feature",
    "Top 5 cost tenants + anomalies",
    "Faithfulness sample trend",
    "New error classes this week",
    "Traces promoted to golden set",
]
# 30min ML + SRE + product — outputs tickets, not slides
List AGENDA = List.of(
    "TTFT p95 vs SLO by feature",
    "Top 5 cost tenants + anomalies",
    "Faithfulness sample trend"
);
📦 End of Guide 5

You can explain LLM-specific observability gaps, log a canonical schema, pick tracing tooling, define SLIs, and alert with runbooks. Guide 6 adds CI evaluation—regression gates that block bad prompt merges before they reach the traffic you just instrumented.

🔒 Security

Checklist: verify prod logs contain zero raw PAN/SSN samples in weekly automated scan; alert if regex hits increase.

📦 Real World

Teams that complete this checklist before CI eval report 40% faster root-cause on prompt regressions—trace exemplars become golden rows automatically.

🎯 Interview Tip

"Observability stack for LLMs?" — Schema per call, trace waterfall, TTFT/cost SLIs, sampled judges, anomaly alerts, closed loop to eval datasets.

💰 Cost

Budget 1–3% of inference spend for observability SaaS + storage—cheaper than one undetected runaway agent weekend.

🔬 Under the Hood

Trace retention 7–14d hot, 90d cold object storage—long enough for weekly quality review, not a compliance lake by accident.

⚖️ Trade-off

100% tracing gives perfect forensics but doubles cost at scale—10% head-based sampling plus 100% on errors is the usual compromise.

⚠️ Pitfall

Shipping dashboards without alert runbooks—on-call acks pages then closes them; drift continues until CSAT collapses.

Track 5 guide sequence

GuideTopicYou are here
Guide 4Safety & guardrails← prev
Guide 5Observability & monitoring
Guide 6Continuous evaluation in CInext →

FAQ: Sample rate for traces?

100% staging; prod 5–10% head sample plus 100% on error and guardrail block. Increase temporarily during launches.

FAQ: Who owns quality alerts?

ML oncall triages; product approves prompt rollback; platform owns latency/error pages.