Fine-tuning & adaptation

When to fine-tune: prompt vs RAG vs fine-tune

Adaptation changes inputs and orchestration at query time. Fine-tuning changes the model’s default behavior by updating weights. The wrong choice wastes GPU budget, creates stale knowledge, and blocks fast policy updates. Start with the cheapest path that can pass your eval suite—not the path that sounds most impressive.

Most production LLM features need three capabilities: correct facts, consistent format, and appropriate tone. These map to different adaptation layers. Facts that change weekly belong in a retrieval index, not in memorized weights. Stable JSON schemas and classification boundaries are fine-tuning sweet spots when prompt engineering plateaus on your golden set.

Four adaptation paths

Path	What moves	Best when	Weak when
Prompt engineering	System instructions, few-shot examples, output schema	Tone, refusals, structured output, short policy summaries	Long factual corpora stuffed into context; brittle format at scale
RAG	Retrieved chunks injected at query time	Policies, docs, catalogs that change often; cite-able answers	Needle-in-haystack without good index; latency-sensitive micro-classifiers
Routing + tools	Model tier, agent loop, external APIs	Structured workflows (refunds, lookups, calculations)	Pure free-text FAQ where every answer is generative
Fine-tuning	Model weights (full or adapter)	Stable style, intent classification, extraction schemas, domain phrasing	Facts change weekly; insufficient labeled data; prompt+RAG already pass eval

Decision matrix

Score each incoming use case against these signals. If policy or catalog freshness is critical, RAG wins regardless of team preference. If the task is “always return this JSON shape for support tickets,” try structured output + prompt first; fine-tune only when eval shows persistent format drift across hundreds of real examples.

Signal	Prefer	Avoid fine-tune because
Answer must cite source documents	RAG + citation prompt	FT cannot guarantee retrieval; weights hallucinate confidently
Policy updates > once/month	RAG	Retrain lag; stale weights until next job completes
Fixed output schema (100+ fields)	Structured output → then FT if needed	FT helps only after native schema mode exhausted
Intent routing (5–20 labels)	Mini model + prompt → FT on mini	Frontier FT is expensive for simple classification
Brand voice stable 12+ months	Prompt → FT on style rows	Prompt alone may work; FT is optimization not unblocker
Multilingual phrasing in one domain	FT or adapter on open model	Prompts balloon per locale

⚠️ Pitfall

Never fine-tune for knowledge alone. If the model must know this quarter’s refund policy, product SKUs, or internal org chart, that information belongs in RAG or a database tool—not in weights. Fine-tuned “facts” go stale silently; users get confident wrong answers with no audit trail. RAG gives you chunk IDs to log; fine-tuning gives you regret.

💡 Pro Tip

Run the same golden eval through prompt-only, RAG, and (if justified) a fine-tuned endpoint before committing to FT. If prompt + RAG already meet your threshold, fine-tuning is a cost/latency optimization—not a project unblocker. Document the comparison in the PR that approves GPU spend.

Decision flowchart

flowchart TD
  A[New LLM feature request] --> B{Does answer require
citing live documents?}
  B -->|Yes| C[Use RAG]
  B -->|No| D{Is behavior stable
12+ months?}
  D -->|No| E[Prompt + tools + routing]
  D -->|Yes| F{Prompt + structured output
passes eval?}
  F -->|Yes| G[Ship prompt path — no FT]
  F -->|No| H{Have 500+ quality
labeled examples?}
  H -->|No| I[Collect data — improve prompt/RAG first]
  H -->|Yes| J{Before/after eval
shows FT lift?}
  J -->|No| K[Do not fine-tune]
  J -->|Yes| L[Fine-tune with CI eval gate]
  C --> M[Track chunk freshness SLA]
  L --> N[Register model ID in gateway + flags]

📦 Real World

Support teams often request fine-tuning after 50 “bad tone” examples. In practice, a revised system prompt + 8 curated few-shot pairs fixes 80% of tone issues in a day. Reserve fine-tuning for classification F1 stuck below target after prompt iteration, or extraction JSON that fails schema validation on >5% of production traffic despite response_format.

🎯 Interview Tip

When asked “RAG or fine-tune?” answer with the freshness axis: RAG for mutable facts, fine-tune for stable behavior patterns. Mention eval gates—never ship FT without base vs fine-tuned comparison on the same golden set. Interviewers listen for “we fine-tuned the retriever embedding model” vs “we fine-tuned GPT-4 for policy”—different problems, different data.

⚖️ Trade-off

Prompts and RAG update in minutes; fine-tuning updates in hours to days plus re-eval cycles. The operational cost of FT is not the training bill—it is the retrain cadence and model registry every time behavior must change. Prefer adaptation paths your on-call can roll back with a flag flip.

Prerequisite context: RAG explained, system prompts, and LLM gateway for routing fine-tuned model IDs alongside base models.

Approaches: full FT, LoRA, QLoRA, instruction tuning

“Fine-tuning” spans updating every weight in a 70B model to training a 16MB LoRA adapter on a laptop. Hosted APIs (OpenAI, Bedrock) hide the technique; self-hosted open models expose the full menu. Pick based on model access, GPU budget, and whether you need mergeable artifacts for air-gapped deploy.

Comparison table

Approach	What updates	VRAM / cost	Typical use	Production notes
Full fine-tuning	All model parameters	Very high (multi-GPU, 70B+ impractical without cluster)	Domain shift on open models; research; small models (<7B)	Produces full checkpoint; hard to A/B vs base without duplicate serving
LoRA	Low-rank adapter matrices on attention/MLP layers	Moderate (often 1–4× less than full FT)	Instruction following, style, classification on 7B–70B	Swap adapters at runtime; multiple tasks on one base
QLoRA	LoRA on 4-bit quantized base weights	Low (consumer GPU for 7B–13B)	Prototyping, internal tools, cost-sensitive open-model FT	Quantization-aware; validate perplexity vs bf16 base before prod
Instruction tuning	SFT on (instruction, response) pairs—full or LoRA	Depends on base technique	Chat alignment, tool-call format, JSON extraction	OpenAI hosted FT is instruction-style on chat messages JSONL

Hosted vs self-hosted

OpenAI fine-tuning is managed instruction tuning: you upload JSONL, they train, you receive ft:gpt-4o-mini:org:... model IDs. You do not choose LoRA rank or learning rate on the public API. AWS Bedrock offers custom model import (Continued Pre-training, fine-tuning on supported base models) with S3-stored output artifacts and provisioned throughput options. Self-hosted (Hugging Face PEFT, Axolotl, LLaMA-Factory) gives full control over LoRA/QLoRA hyperparameters and merge-to-single-checkpoint for vLLM or TGI serving.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                    # rank — higher = more capacity, more VRAM
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)
# Train only ~0.1–1% of parameters; save adapter dir, not full 7B weights

// Self-hosted LoRA typically runs via Python training jobs.
// Java serving layer loads base + adapter metadata from model registry:

public record AdapterManifest(
    String baseModelId,      // e.g. meta-llama/Llama-3.1-8B
    String adapterPath,      // s3://models/support-intent-lora-v3
    int rank,
    String taskType
) {}

// Gateway resolves: tenant → adapter manifest → inference endpoint

🔬 Under the Hood

LoRA injects trainable low-rank matrices ΔW = BA alongside frozen weights W. At inference, effective weight is W + BA (or merged once for latency). QLoRA keeps W in 4-bit NF4 while adapters stay fp16/bf16— why a 24GB GPU can fine-tune 13B models that full FT would exclude.

💰 Cost

Self-hosted QLoRA on a 7B model: roughly $2–10 in spot GPU for a prototype run. Full FT on 70B without adapters: thousands per experiment. Hosted OpenAI FT charges per training token (~$8/1M)—often cheaper than engineer + GPU time for small classification tasks on mini models if you already use the API in production.

⚠️ Pitfall

Training a LoRA on stale base model weights, then serving a newer base at inference—adapter mismatch causes garbage outputs. Pin base model revision in your registry (meta-llama/Llama-3.1-8B@sha256:…) and block deploy if base hash differs from training manifest.

When each approach wins

Hosted OpenAI FT — fast path for gpt-4o-mini classification/extraction; no GPU ops team
Bedrock custom model — enterprise AWS footprint, IAM-bound artifacts, private VPC inference
LoRA self-hosted — multiple tenant-specific adapters on one GPU pool; air-gapped deploy
Full FT — rare in prod; consider only small specialist models (<3B) or embedding retrievers

🔒 Security

Training data may contain PII and secrets scrubbed from prompts but present in historical tickets. Run redaction pipelines before JSONL export; restrict fine-tune dataset buckets to ML roles; never log raw training rows in CI artifacts. Bedrock and OpenAI retain data per their fine-tuning policies—read DPA before uploading customer content.

Data requirements: volume, format, quality

Fine-tuning quality is bounded by label quality, not row count alone. Fifty pristine examples can beat five thousand noisy auto-labeled pairs. Still, most instruction-tuning tasks need 50–100 examples minimum to show signal; 500–1,000 diverse rows is the practical bar for production confidence on classification and extraction.

Volume guidelines

Task type	Minimum	Production target	Notes
Binary / 3-class intent	50–100 per class	500+ total, balanced	Watch minority class recall
JSON extraction (10 fields)	100 examples	800–1,000	Cover optional/null field combos
Style / tone alignment	50 high-quality pairs	200–400 curated	Quality >> quantity; one bad row teaches bad habits
Multi-turn tool format	80 trajectories	500+	Include negative tool-call abstentions

💡 Pro Tip

Quality over quantity. One hour of SME review on 100 rows beats overnight LLM-generated synthetic bulk. Auto-label with a strong model, then human-accept/reject—never train on unreviewed synthetic data for compliance-facing tasks.

JSONL format (OpenAI chat fine-tuning)

Each line is one JSON object with a messages array. Roles: system, user, assistant. Optional weight on assistant messages for DPO-style preference data on supported pipelines. Validate with provider CLI before upload.

# data/ft_support_intent.jsonl — one JSON object per line
EXAMPLE_CLASSIFY = {
    "messages": [
        {"role": "system", "content": "Classify support tickets. Reply with JSON: {\"intent\": \"...\", \"priority\": \"...\"}"},
        {"role": "user", "content": "My card was charged twice for order ORD-8821. Need refund ASAP."},
        {"role": "assistant", "content": "{\"intent\": \"billing_duplicate_charge\", \"priority\": \"high\"}"},
    ]
}

EXAMPLE_EXTRACT = {
    "messages": [
        {"role": "system", "content": "Extract refund fields as JSON. Use null for missing."},
        {"role": "user", "content": "Customer Jane Doe, order ORD-4420, wants partial refund $24.50 for damaged item."},
        {"role": "assistant", "content": "{\"order_id\": \"ORD-4420\", \"customer_name\": \"Jane Doe\", \"amount_usd\": 24.50, \"reason\": \"damaged_item\"}"},
    ]
}

import json
from pathlib import Path

def write_jsonl(rows: list[dict], path: Path) -> None:
    with path.open("w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

// Jackson writes one JSON object per line for OpenAI upload
record ChatMessage(String role, String content) {}
record FineTuneRow(List<ChatMessage> messages) {}

void writeJsonl(List<FineTuneRow> rows, Path path) throws IOException {
    var mapper = new ObjectMapper();
    try (var writer = Files.newBufferedWriter(path)) {
        for (FineTuneRow row : rows) {
            writer.write(mapper.writeValueAsString(row));
            writer.newLine();
        }
    }
}

// Example row — intent classification
var classify = new FineTuneRow(List.of(
    new ChatMessage("system", "Classify support tickets as JSON intent + priority."),
    new ChatMessage("user", "Charged twice on ORD-8821, need refund."),
    new ChatMessage("assistant", "{\"intent\":\"billing_duplicate_charge\",\"priority\":\"high\"}")
));

Data hygiene checklist

Deduplicate — near-duplicate tickets overweight frequent phrasing
Stratify — hold out 15–20% validation split never seen in training
Tag provenance — source: human|synthetic|prod_sample per row
Version datasets — append-only git LFS or S3 with semver (ft-v1.3.0.jsonl)
PII scrub — replace names/emails with placeholders consistent at train and eval time

Augmentation (careful)

Safe augmentations: paraphrase user text while preserving gold label; swap entity values (order IDs, amounts) in extraction tasks; translate to secondary locale if prod is multilingual. Unsafe: LLM-generated assistant responses without review; mixing contradictory labels for the same input; augmenting policy answers that should come from RAG.

import copy, random, re

ORDER_RE = re.compile(r"ORD-\d{4}")

def augment_order_ids(row: dict, n: int = 3) -> list[dict]:
    """Swap order IDs in user text; assistant JSON updated to match."""
    out = []
    base_user = row["messages"][1]["content"]
    base_asst = json.loads(row["messages"][2]["content"])
    for _ in range(n):
        new_id = f"ORD-{random.randint(1000, 9999)}"
        new_row = copy.deepcopy(row)
        new_row["messages"][1]["content"] = ORDER_RE.sub(new_id, base_user, count=1)
        asst = copy.deepcopy(base_asst)
        asst["order_id"] = new_id
        new_row["messages"][2]["content"] = json.dumps(asst)
        new_row["metadata"] = {"augmentation": "entity_swap"}
        out.append(new_row)
    return out

List<FineTuneRow> augmentOrderIds(FineTuneRow base, int n, Random rng) {
    var results = new ArrayList<FineTuneRow>();
    var userText = base.messages().get(1).content();
    for (int i = 0; i < n; i++) {
        String newId = "ORD-" + (1000 + rng.nextInt(9000));
        String newUser = userText.replaceFirst("ORD-\\d{4}", newId);
        // Parse assistant JSON, swap order_id, re-serialize — same pattern as Python
        results.add(/* rebuilt row */);
    }
    return results;
}

⚠️ Pitfall

Training on production logs without filtering failed or escalated conversations teaches the model to reproduce bad agent behavior. Include only rows where human SME or post-hoc eval marked the response as correct.

📦 Real World

Mature teams maintain a fine-tune backlog fed by eval failures: when golden case X fails for 3 consecutive prompt versions, SME writes 5–10 corrected pairs and opens a dataset PR. Fine-tune jobs trigger only when backlog crosses 500 reviewed rows and prompt path plateaus—same loop as CI evaluation.

⚖️ Trade-off

Synthetic data scales fast but encodes generator biases. A model fine-tuned on GPT-4-generated labels may inherit GPT-4 failure modes your base mini model would otherwise avoid. Cap synthetic ratio (e.g. ≤30%) unless human-reviewed.

OpenAI fine-tuning: upload, jobs, models, cost

OpenAI’s managed fine-tuning pipeline: validate JSONL → upload file → create job → poll until succeeded → deploy returned model ID through your gateway. Training is billed per token processed during the job—plan ~$8 per 1M training tokens for recent mini-model rates (verify current pricing before budgeting).

Supported models (typical)

Eligible base models change with provider releases. Common production choices:

Base model	Best for	Training cost order
gpt-4o-mini	Classification, extraction, routing	Low — default first FT experiment
gpt-4o	Complex format + reasoning style	Higher train + inference
gpt-3.5-turbo (legacy)	Existing FT continuity	Check deprecation notices

Check OpenAI docs for current supported_models before pipeline automation.

Job lifecycle

validating_files — schema and token count checks
queued — waiting for capacity
running — training in progress (minutes to hours)
succeeded — fine_tuned_model ID available
failed — inspect error object; fix data, retry
cancelled — user or timeout abort

from openai import OpenAI
import time

client = OpenAI()

# 1. Upload training file
with open("data/ft_support_intent.jsonl", "rb") as f:
    train_file = client.files.create(file=f, purpose="fine-tune")

# 2. Create job
job = client.fine_tuning.jobs.create(
    training_file=train_file.id,
    model="gpt-4o-mini-2024-07-18",
    suffix="support-intent-v1",
    hyperparameters={"n_epochs": 3},  # start conservative; watch overfit on small sets
)

# 3. Poll until complete
while job.status not in ("succeeded", "failed", "cancelled"):
    time.sleep(30)
    job = client.fine_tuning.jobs.retrieve(job.id)
    print(job.status, job.trained_tokens)

if job.status != "succeeded":
    raise RuntimeError(job.error)

fine_tuned_model = job.fine_tuned_model  # ft:gpt-4o-mini:org:support-intent-v1:abc123

# 4. Smoke inference
resp = client.chat.completions.create(
    model=fine_tuned_model,
    messages=[{"role": "user", "content": "Double charge on ORD-9912"}],
)
print(resp.choices[0].message.content)

// Spring AI / OpenAI Java SDK pattern
OpenAiApi api = new OpenAiApi(System.getenv("OPENAI_API_KEY"));

Resource trainingFile = new FileSystemResource("data/ft_support_intent.jsonl");
String fileId = api.uploadFile(trainingFile, "fine-tune").getId();

FineTuningJobRequest req = FineTuningJobRequest.builder()
    .trainingFile(fileId)
    .model("gpt-4o-mini-2024-07-18")
    .suffix("support-intent-v1")
    .hyperparameters(Map.of("n_epochs", 3))
    .build();

FineTuningJob job = api.createFineTuningJob(req);
while (!List.of("succeeded", "failed", "cancelled").contains(job.getStatus())) {
    Thread.sleep(30_000);
    job = api.retrieveFineTuningJob(job.getId());
}

String fineTunedModel = job.getFineTunedModel();
// Route via gateway: model_registry.put("support-intent", fineTunedModel);

Cost estimation

Estimate training tokens ≈ sum of all message tokens × epochs. Example: 800 rows × ~400 tokens/row × 3 epochs ≈ 960K training tokens → ~$7.68 at $8/1M. Add 20% buffer for validation files and failed retries. Inference on fine-tuned models uses the same per-token inference rates as the base model tier—FT does not reduce inference cost, only improves task accuracy per call.

💰 Cost

Training: ~$8/1M tokens (verify live pricing). A 1,000-row job often lands under $15 total—cheap compared to engineer time if eval proves lift. Hidden cost: running 5 failed experiments without eval gates burns budget and morale.

AWS Bedrock custom models

For teams standardized on Bedrock, custom model import and model customization jobs store artifacts in S3 and expose inference through Bedrock runtime with IAM boundaries. Flow parallels OpenAI: prepare JSONL/JSON → upload to S3 → create customization job → deploy custom model ARN in gateway routing table alongside foundation model IDs.

import boto3

bedrock = boto3.client("bedrock")

# Exact API varies by base model family — consult Bedrock model customization docs
job = bedrock.create_model_customization_job(
    jobName="support-intent-v1",
    customModelName="support-intent-classifier",
    roleArn="arn:aws:iam::123456789012:role/BedrockCustomizationRole",
    baseModelIdentifier="anthropic.claude-3-haiku-20240307-v1:0",
    trainingDataConfig={
        "s3Uri": "s3://ml-datasets/ft/support-intent-v1.jsonl"
    },
    outputDataConfig={
        "s3Uri": "s3://ml-models/custom/support-intent-v1/"
    },
    hyperParameters={"epochCount": "3", "batchSize": "8"},
)

# Poll describe_model_customization_job → status Complete
# Register customModelArn in llm-gateway bedrock routing

🔬 Under the Hood

OpenAI counts every token in every message across all epochs toward training bill. Shorter system prompts in training data reduce cost but must match production system prompt semantics—or you train a distribution shift into the adapter.

⚠️ Pitfall

Setting n_epochs too high on <200 rows overfits memorization: perfect training loss, worse held-out eval. Start with 2–3 epochs; use validation loss curves if provider exposes them; early-stop on eval regression.

💡 Pro Tip

Store fine_tuning.job.id, training file hash, dataset semver, and resulting model ID in your model registry YAML. When auditors ask “what data trained model X?” you answer from git, not Slack history.

Eval before & after: never ship without comparison

A fine-tuned model that beats your intuition but loses 4 points on refund intent recall is a production incident waiting to happen. Run the same eval suite against base and fine-tuned endpoints; promote only when aggregate and per-bucket metrics improve or tie within noise with acceptable cost/latency trade-offs.

Reuse Track 5 infrastructure: golden JSONL, scorers, CI thresholds from evaluation explained, CI evaluation, and LLM-as-judge patterns. Fine-tune eval is not a separate discipline—it is another model variant in the same regression matrix as prompt v2 or RAG index v3.

Comparison protocol

Freeze golden set — held-out rows never in training JSONL
Run base model — record scores, latencies, token usage per case ID
Run fine-tuned model — identical prompts, temperature, max_tokens
Diff per bucket — intent, locale, risk tier; block if any bucket drops >5%
Human review sample — 20 regressions where FT scored lower than base
Sign-off artifact — JSON report committed or attached to deploy PR

Metric	Accept FT if	Block if
Exact match (classification)	+3% absolute vs base	Any safety bucket regression
JSON schema valid rate	+5% or base <95% and FT ≥98%	FT < base
LLM judge faithfulness	+0.05 mean (RAG tasks)	−0.03 on policy bucket
P95 latency	Within 10% of base	>25% slower without accuracy gain

"""scripts/compare_ft_models.py — run before any FT deploy."""
from dataclasses import dataclass
from pathlib import Path
import json

@dataclass
class EvalRow:
    id: str
    input: str
    expected: dict
    tags: list[str]

@dataclass
class CaseResult:
    row_id: str
    model: str
    score: float
    output: str
    latency_ms: float

def load_golden(path: Path) -> list[EvalRow]:
    return [EvalRow(**json.loads(line)) for line in path.read_text().splitlines() if line.strip()]

def score_exact_json(output: str, expected: dict) -> float:
    try:
        parsed = json.loads(output)
        return 1.0 if parsed == expected else 0.0
    except json.JSONDecodeError:
        return 0.0

def run_suite(rows: list[EvalRow], model_id: str, complete_fn) -> list[CaseResult]:
    results = []
    for row in rows:
        out, latency = complete_fn(model_id, row.input)
        results.append(CaseResult(row.id, model_id, score_exact_json(out, row.expected), out, latency))
    return results

def compare(base: list[CaseResult], ft: list[CaseResult], min_lift: float = 0.03) -> dict:
    by_id_base = {r.row_id: r for r in base}
    by_id_ft = {r.row_id: r for r in ft}
    lifts = [by_id_ft[i].score - by_id_base[i].score for i in by_id_base]
    mean_lift = sum(lifts) / len(lifts)
    regressions = [i for i in by_id_base if by_id_ft[i].score < by_id_base[i].score]
    return {
        "mean_lift": mean_lift,
        "pass": mean_lift >= min_lift and len(regressions) / len(lifts) < 0.15,
        "regression_ids": regressions[:20],
    }

# Usage:
# base_results = run_suite(golden, "gpt-4o-mini", gateway_complete)
# ft_results = run_suite(golden, "ft:gpt-4o-mini:org:...", gateway_complete)
# report = compare(base_results, ft_results)
# assert report["pass"], report

public record EvalRow(String id, String input, Map<String, Object> expected, List<String> tags) {}
public record CaseResult(String rowId, String model, double score, String output, long latencyMs) {}

public CompareReport compare(List<CaseResult> base, List<CaseResult> ft, double minLift) {
    Map<String, CaseResult> baseMap = base.stream()
        .collect(Collectors.toMap(CaseResult::rowId, r -> r));
    List<Double> lifts = new ArrayList<>();
    List<String> regressions = new ArrayList<>();
    for (CaseResult ftRow : ft) {
        CaseResult b = baseMap.get(ftRow.rowId());
        double lift = ftRow.score() - b.score();
        lifts.add(lift);
        if (lift < 0) regressions.add(ftRow.rowId());
    }
    double meanLift = lifts.stream().mapToDouble(d -> d).average().orElse(0);
    boolean pass = meanLift >= minLift && (double) regressions.size() / lifts.size() < 0.15;
    return new CompareReport(meanLift, pass, regressions.stream().limit(20).toList());
}

⚠️ Pitfall

Never ship without comparison. Teams promote FT after manual vibe-check on 10 examples while base model with an improved prompt would have passed full CI. Wire FT promotion to the same GitHub Actions gate as prompt changes— see continuous evaluation in CI/CD.

🔒 Security

Include adversarial and jailbreak cases from safety guardrails in FT comparison. Fine-tuning can erode refusal behavior if training data lacks negative examples. Zero-tolerance: any safety regression blocks deploy.

🎯 Interview Tip

“How do you validate a fine-tuned model?” — Held-out golden set, base vs FT A/B, bucketed metrics, safety suite unchanged, human review on regressions, model ID pinned in registry with dataset hash. Mention CI gate, not offline accuracy alone.

📦 Real World

Shadow traffic: route 5% of prod requests to FT endpoint, log scores with online eval sampling from observability & monitoring. Promote to 100% only when offline lift confirms and shadow week shows no ticket escalation spike.

CI integration sketch

PR opens (dataset or FT job completes)
  → evals/compare_ft.yaml workflow
  → download golden held-out v2.1
  → run base + candidate ft model
  → upload compare.json artifact
  → PR comment: mean_lift + regression table
  → merge blocked if pass=false

Production: adaptation checklist

Fine-tuned models are production assets: register IDs in the gateway, gate rollout with feature flags, log dataset lineage, and keep a rollback path to base model within one config change. This checklist closes Guide 3 before multimodal ingest pipelines.

Pre-deploy checklist

Decision doc — why FT vs prompt/RAG; link to path comparison eval
Dataset semver — training JSONL hash in model registry
Held-out eval — base vs FT report attached; all buckets green
Safety suite — no regression on adversarial golden
Gateway entry — model_alias: support-intent → ft:… in routing config
Feature flag — ft_support_intent_v1 default off; staged % rollout
Cost projection — inference tokens unchanged; training cost amortized in FinOps tag
Rollback runbook — flip flag to base model; no redeploy required
Retrain trigger — backlog threshold + eval plateau documented
On-call — ML owner paged on FT latency/error rate anomalies

Control	Implementation	Alert when
Model ID drift	Registry validates ft: prefix at deploy	Unknown model in request log
FT error rate	Compare 5xx vs base model route	FT >2× base for 15 min
Quality drift	Online eval sample 1%	Judge score −0.05 vs weekly baseline
Stale weights	Policy freshness in RAG; FT age in registry	FT >90 days without re-eval

# config/models.yaml
MODELS = {
    "support-intent": {
        "default": "gpt-4o-mini-2024-07-18",
        "variants": {
            "ft_v1": {
                "model_id": "ft:gpt-4o-mini:org:support-intent-v1:abc123",
                "dataset": "ft-support-intent-v1.2.0",
                "dataset_sha256": "a1b2c3…",
                "eval_report": "reports/ft_v1_compare.json",
                "flag": "ft_support_intent_v1",
            }
        },
    }
}

def resolve_model(feature: str, flags: dict) -> str:
    cfg = MODELS[feature]
    if flags.get("ft_support_intent_v1"):
        return cfg["variants"]["ft_v1"]["model_id"]
    return cfg["default"]

@ConfigurationProperties("llm.models")
public class ModelRegistry {
    Map<String, FeatureModels> features;

    public String resolve(String feature, FeatureFlags flags) {
        var f = features.get(feature);
        if (flags.isEnabled("ft_support_intent_v1")) {
            return f.variants().get("ft_v1").modelId();
        }
        return f.defaultModel();
    }
}

📦 Production checklist

Guide 3 complete when: FT model is behind a flag, registry documents dataset lineage, CI blocks deploy without compare report, rollback tested in staging, and on-call knows which flag to disable. Continue to multimodal ingest for document pipelines.

⚖️ Trade-off

Maintaining both base and FT endpoints doubles model surface area for security reviews and cost attribution. Sunset base-only path only when FT survives two re-eval cycles—not after first successful week.

💰 Cost

Fine-tuning does not reduce inference token prices. Budget FT when it reduces downstream cost: fewer repair loops, smaller prompts, or replacing frontier with fine-tuned mini on the same quality bar.

🔬 Under the Hood

OpenAI fine-tuned models share rate limits with your org tier on that model family. Load-test FT endpoint before 100% flag— cold start is rare but concurrency limits bite during flash traffic.

⚠️ Pitfall

Deploying FT for knowledge that should be RAG: model answers confidently after policy change until someone notices support tickets. Keep RAG for facts; FT for format and routing. Re-run decision matrix on every retrain request.