Lang
API

Fine-tuning & adaptation

Fine-tuning bakes behavior into weights; RAG and prompts change inputs at query time. Teams often fine-tune because it feels like “real ML,” then discover refund policies still wrong because facts live in PDFs that change weekly. Never fine-tune for knowledge alone—use RAG for facts, prompts for tone and format, and fine-tuning only when you have stable patterns, labeled data, and eval proof that adaptation paths cannot reach your bar.

After reading, you should be able to: choose among prompt engineering, RAG, routing, and fine-tuning with a decision matrix; compare full FT, LoRA, QLoRA, and instruction tuning; prepare JSONL training data with quality gates; run OpenAI fine-tuning jobs and read lifecycle states; estimate training cost at ~$8/1M tokens; run before/after eval suites against base and fine-tuned models; wire Bedrock custom model imports; and complete the Track 6 adaptation checklist before production rollout.

developer platform architect Track 6 Guide 3 LoRA OpenAI FT Bedrock

When to fine-tune: prompt vs RAG vs fine-tune

Adaptation changes inputs and orchestration at query time. Fine-tuning changes the model’s default behavior by updating weights. The wrong choice wastes GPU budget, creates stale knowledge, and blocks fast policy updates. Start with the cheapest path that can pass your eval suite—not the path that sounds most impressive.

Most production LLM features need three capabilities: correct facts, consistent format, and appropriate tone. These map to different adaptation layers. Facts that change weekly belong in a retrieval index, not in memorized weights. Stable JSON schemas and classification boundaries are fine-tuning sweet spots when prompt engineering plateaus on your golden set.

Four adaptation paths

Path What moves Best when Weak when
Prompt engineering System instructions, few-shot examples, output schema Tone, refusals, structured output, short policy summaries Long factual corpora stuffed into context; brittle format at scale
RAG Retrieved chunks injected at query time Policies, docs, catalogs that change often; cite-able answers Needle-in-haystack without good index; latency-sensitive micro-classifiers
Routing + tools Model tier, agent loop, external APIs Structured workflows (refunds, lookups, calculations) Pure free-text FAQ where every answer is generative
Fine-tuning Model weights (full or adapter) Stable style, intent classification, extraction schemas, domain phrasing Facts change weekly; insufficient labeled data; prompt+RAG already pass eval

Decision matrix

Score each incoming use case against these signals. If policy or catalog freshness is critical, RAG wins regardless of team preference. If the task is “always return this JSON shape for support tickets,” try structured output + prompt first; fine-tune only when eval shows persistent format drift across hundreds of real examples.

Signal Prefer Avoid fine-tune because
Answer must cite source documents RAG + citation prompt FT cannot guarantee retrieval; weights hallucinate confidently
Policy updates > once/month RAG Retrain lag; stale weights until next job completes
Fixed output schema (100+ fields) Structured output → then FT if needed FT helps only after native schema mode exhausted
Intent routing (5–20 labels) Mini model + prompt → FT on mini Frontier FT is expensive for simple classification
Brand voice stable 12+ months Prompt → FT on style rows Prompt alone may work; FT is optimization not unblocker
Multilingual phrasing in one domain FT or adapter on open model Prompts balloon per locale
⚠️ Pitfall

Never fine-tune for knowledge alone. If the model must know this quarter’s refund policy, product SKUs, or internal org chart, that information belongs in RAG or a database tool—not in weights. Fine-tuned “facts” go stale silently; users get confident wrong answers with no audit trail. RAG gives you chunk IDs to log; fine-tuning gives you regret.

💡 Pro Tip

Run the same golden eval through prompt-only, RAG, and (if justified) a fine-tuned endpoint before committing to FT. If prompt + RAG already meet your threshold, fine-tuning is a cost/latency optimization—not a project unblocker. Document the comparison in the PR that approves GPU spend.

Decision flowchart

flowchart TD
  A[New LLM feature request] --> B{Does answer require
citing live documents?} B -->|Yes| C[Use RAG] B -->|No| D{Is behavior stable
12+ months?} D -->|No| E[Prompt + tools + routing] D -->|Yes| F{Prompt + structured output
passes eval?} F -->|Yes| G[Ship prompt path — no FT] F -->|No| H{Have 500+ quality
labeled examples?} H -->|No| I[Collect data — improve prompt/RAG first] H -->|Yes| J{Before/after eval
shows FT lift?} J -->|No| K[Do not fine-tune] J -->|Yes| L[Fine-tune with CI eval gate] C --> M[Track chunk freshness SLA] L --> N[Register model ID in gateway + flags]
📦 Real World

Support teams often request fine-tuning after 50 “bad tone” examples. In practice, a revised system prompt + 8 curated few-shot pairs fixes 80% of tone issues in a day. Reserve fine-tuning for classification F1 stuck below target after prompt iteration, or extraction JSON that fails schema validation on >5% of production traffic despite response_format.

🎯 Interview Tip

When asked “RAG or fine-tune?” answer with the freshness axis: RAG for mutable facts, fine-tune for stable behavior patterns. Mention eval gates—never ship FT without base vs fine-tuned comparison on the same golden set. Interviewers listen for “we fine-tuned the retriever embedding model” vs “we fine-tuned GPT-4 for policy”—different problems, different data.

⚖️ Trade-off

Prompts and RAG update in minutes; fine-tuning updates in hours to days plus re-eval cycles. The operational cost of FT is not the training bill—it is the retrain cadence and model registry every time behavior must change. Prefer adaptation paths your on-call can roll back with a flag flip.

Prerequisite context: RAG explained, system prompts, and LLM gateway for routing fine-tuned model IDs alongside base models.

Approaches: full FT, LoRA, QLoRA, instruction tuning

“Fine-tuning” spans updating every weight in a 70B model to training a 16MB LoRA adapter on a laptop. Hosted APIs (OpenAI, Bedrock) hide the technique; self-hosted open models expose the full menu. Pick based on model access, GPU budget, and whether you need mergeable artifacts for air-gapped deploy.

Comparison table

Approach What updates VRAM / cost Typical use Production notes
Full fine-tuning All model parameters Very high (multi-GPU, 70B+ impractical without cluster) Domain shift on open models; research; small models (<7B) Produces full checkpoint; hard to A/B vs base without duplicate serving
LoRA Low-rank adapter matrices on attention/MLP layers Moderate (often 1–4× less than full FT) Instruction following, style, classification on 7B–70B Swap adapters at runtime; multiple tasks on one base
QLoRA LoRA on 4-bit quantized base weights Low (consumer GPU for 7B–13B) Prototyping, internal tools, cost-sensitive open-model FT Quantization-aware; validate perplexity vs bf16 base before prod
Instruction tuning SFT on (instruction, response) pairs—full or LoRA Depends on base technique Chat alignment, tool-call format, JSON extraction OpenAI hosted FT is instruction-style on chat messages JSONL

Hosted vs self-hosted

OpenAI fine-tuning is managed instruction tuning: you upload JSONL, they train, you receive ft:gpt-4o-mini:org:... model IDs. You do not choose LoRA rank or learning rate on the public API. AWS Bedrock offers custom model import (Continued Pre-training, fine-tuning on supported base models) with S3-stored output artifacts and provisioned throughput options. Self-hosted (Hugging Face PEFT, Axolotl, LLaMA-Factory) gives full control over LoRA/QLoRA hyperparameters and merge-to-single-checkpoint for vLLM or TGI serving.

LoRA config sketch (self-hosted)
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                    # rank — higher = more capacity, more VRAM
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)
# Train only ~0.1–1% of parameters; save adapter dir, not full 7B weights
// Self-hosted LoRA typically runs via Python training jobs.
// Java serving layer loads base + adapter metadata from model registry:

public record AdapterManifest(
    String baseModelId,      // e.g. meta-llama/Llama-3.1-8B
    String adapterPath,      // s3://models/support-intent-lora-v3
    int rank,
    String taskType
) {}

// Gateway resolves: tenant → adapter manifest → inference endpoint
🔬 Under the Hood

LoRA injects trainable low-rank matrices ΔW = BA alongside frozen weights W. At inference, effective weight is W + BA (or merged once for latency). QLoRA keeps W in 4-bit NF4 while adapters stay fp16/bf16— why a 24GB GPU can fine-tune 13B models that full FT would exclude.

💰 Cost

Self-hosted QLoRA on a 7B model: roughly $2–10 in spot GPU for a prototype run. Full FT on 70B without adapters: thousands per experiment. Hosted OpenAI FT charges per training token (~$8/1M)—often cheaper than engineer + GPU time for small classification tasks on mini models if you already use the API in production.

⚠️ Pitfall

Training a LoRA on stale base model weights, then serving a newer base at inference—adapter mismatch causes garbage outputs. Pin base model revision in your registry (meta-llama/Llama-3.1-8B@sha256:…) and block deploy if base hash differs from training manifest.

When each approach wins

  • Hosted OpenAI FT — fast path for gpt-4o-mini classification/extraction; no GPU ops team
  • Bedrock custom model — enterprise AWS footprint, IAM-bound artifacts, private VPC inference
  • LoRA self-hosted — multiple tenant-specific adapters on one GPU pool; air-gapped deploy
  • Full FT — rare in prod; consider only small specialist models (<3B) or embedding retrievers
🔒 Security

Training data may contain PII and secrets scrubbed from prompts but present in historical tickets. Run redaction pipelines before JSONL export; restrict fine-tune dataset buckets to ML roles; never log raw training rows in CI artifacts. Bedrock and OpenAI retain data per their fine-tuning policies—read DPA before uploading customer content.

Data requirements: volume, format, quality

Fine-tuning quality is bounded by label quality, not row count alone. Fifty pristine examples can beat five thousand noisy auto-labeled pairs. Still, most instruction-tuning tasks need 50–100 examples minimum to show signal; 500–1,000 diverse rows is the practical bar for production confidence on classification and extraction.

Volume guidelines

Task typeMinimumProduction targetNotes
Binary / 3-class intent 50–100 per class 500+ total, balanced Watch minority class recall
JSON extraction (10 fields) 100 examples 800–1,000 Cover optional/null field combos
Style / tone alignment 50 high-quality pairs 200–400 curated Quality >> quantity; one bad row teaches bad habits
Multi-turn tool format 80 trajectories 500+ Include negative tool-call abstentions
💡 Pro Tip

Quality over quantity. One hour of SME review on 100 rows beats overnight LLM-generated synthetic bulk. Auto-label with a strong model, then human-accept/reject—never train on unreviewed synthetic data for compliance-facing tasks.

JSONL format (OpenAI chat fine-tuning)

Each line is one JSON object with a messages array. Roles: system, user, assistant. Optional weight on assistant messages for DPO-style preference data on supported pipelines. Validate with provider CLI before upload.

JSONL training examples
# data/ft_support_intent.jsonl — one JSON object per line
EXAMPLE_CLASSIFY = {
    "messages": [
        {"role": "system", "content": "Classify support tickets. Reply with JSON: {\"intent\": \"...\", \"priority\": \"...\"}"},
        {"role": "user", "content": "My card was charged twice for order ORD-8821. Need refund ASAP."},
        {"role": "assistant", "content": "{\"intent\": \"billing_duplicate_charge\", \"priority\": \"high\"}"},
    ]
}

EXAMPLE_EXTRACT = {
    "messages": [
        {"role": "system", "content": "Extract refund fields as JSON. Use null for missing."},
        {"role": "user", "content": "Customer Jane Doe, order ORD-4420, wants partial refund $24.50 for damaged item."},
        {"role": "assistant", "content": "{\"order_id\": \"ORD-4420\", \"customer_name\": \"Jane Doe\", \"amount_usd\": 24.50, \"reason\": \"damaged_item\"}"},
    ]
}

import json
from pathlib import Path

def write_jsonl(rows: list[dict], path: Path) -> None:
    with path.open("w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")
// Jackson writes one JSON object per line for OpenAI upload
record ChatMessage(String role, String content) {}
record FineTuneRow(List<ChatMessage> messages) {}

void writeJsonl(List<FineTuneRow> rows, Path path) throws IOException {
    var mapper = new ObjectMapper();
    try (var writer = Files.newBufferedWriter(path)) {
        for (FineTuneRow row : rows) {
            writer.write(mapper.writeValueAsString(row));
            writer.newLine();
        }
    }
}

// Example row — intent classification
var classify = new FineTuneRow(List.of(
    new ChatMessage("system", "Classify support tickets as JSON intent + priority."),
    new ChatMessage("user", "Charged twice on ORD-8821, need refund."),
    new ChatMessage("assistant", "{\"intent\":\"billing_duplicate_charge\",\"priority\":\"high\"}")
));

Data hygiene checklist

  • Deduplicate — near-duplicate tickets overweight frequent phrasing
  • Stratify — hold out 15–20% validation split never seen in training
  • Tag provenancesource: human|synthetic|prod_sample per row
  • Version datasets — append-only git LFS or S3 with semver (ft-v1.3.0.jsonl)
  • PII scrub — replace names/emails with placeholders consistent at train and eval time

Augmentation (careful)

Safe augmentations: paraphrase user text while preserving gold label; swap entity values (order IDs, amounts) in extraction tasks; translate to secondary locale if prod is multilingual. Unsafe: LLM-generated assistant responses without review; mixing contradictory labels for the same input; augmenting policy answers that should come from RAG.

Entity-swap augmentation
import copy, random, re

ORDER_RE = re.compile(r"ORD-\d{4}")

def augment_order_ids(row: dict, n: int = 3) -> list[dict]:
    """Swap order IDs in user text; assistant JSON updated to match."""
    out = []
    base_user = row["messages"][1]["content"]
    base_asst = json.loads(row["messages"][2]["content"])
    for _ in range(n):
        new_id = f"ORD-{random.randint(1000, 9999)}"
        new_row = copy.deepcopy(row)
        new_row["messages"][1]["content"] = ORDER_RE.sub(new_id, base_user, count=1)
        asst = copy.deepcopy(base_asst)
        asst["order_id"] = new_id
        new_row["messages"][2]["content"] = json.dumps(asst)
        new_row["metadata"] = {"augmentation": "entity_swap"}
        out.append(new_row)
    return out
List<FineTuneRow> augmentOrderIds(FineTuneRow base, int n, Random rng) {
    var results = new ArrayList<FineTuneRow>();
    var userText = base.messages().get(1).content();
    for (int i = 0; i < n; i++) {
        String newId = "ORD-" + (1000 + rng.nextInt(9000));
        String newUser = userText.replaceFirst("ORD-\\d{4}", newId);
        // Parse assistant JSON, swap order_id, re-serialize — same pattern as Python
        results.add(/* rebuilt row */);
    }
    return results;
}
⚠️ Pitfall

Training on production logs without filtering failed or escalated conversations teaches the model to reproduce bad agent behavior. Include only rows where human SME or post-hoc eval marked the response as correct.

📦 Real World

Mature teams maintain a fine-tune backlog fed by eval failures: when golden case X fails for 3 consecutive prompt versions, SME writes 5–10 corrected pairs and opens a dataset PR. Fine-tune jobs trigger only when backlog crosses 500 reviewed rows and prompt path plateaus—same loop as CI evaluation.

⚖️ Trade-off

Synthetic data scales fast but encodes generator biases. A model fine-tuned on GPT-4-generated labels may inherit GPT-4 failure modes your base mini model would otherwise avoid. Cap synthetic ratio (e.g. ≤30%) unless human-reviewed.

OpenAI fine-tuning: upload, jobs, models, cost

OpenAI’s managed fine-tuning pipeline: validate JSONL → upload file → create job → poll until succeeded → deploy returned model ID through your gateway. Training is billed per token processed during the job—plan ~$8 per 1M training tokens for recent mini-model rates (verify current pricing before budgeting).

Supported models (typical)

Eligible base models change with provider releases. Common production choices:

Base modelBest forTraining cost order
gpt-4o-miniClassification, extraction, routingLow — default first FT experiment
gpt-4oComplex format + reasoning styleHigher train + inference
gpt-3.5-turbo (legacy)Existing FT continuityCheck deprecation notices

Check OpenAI docs for current supported_models before pipeline automation.

Job lifecycle

  1. validating_files — schema and token count checks
  2. queued — waiting for capacity
  3. running — training in progress (minutes to hours)
  4. succeededfine_tuned_model ID available
  5. failed — inspect error object; fix data, retry
  6. cancelled — user or timeout abort
OpenAI fine-tune API
from openai import OpenAI
import time

client = OpenAI()

# 1. Upload training file
with open("data/ft_support_intent.jsonl", "rb") as f:
    train_file = client.files.create(file=f, purpose="fine-tune")

# 2. Create job
job = client.fine_tuning.jobs.create(
    training_file=train_file.id,
    model="gpt-4o-mini-2024-07-18",
    suffix="support-intent-v1",
    hyperparameters={"n_epochs": 3},  # start conservative; watch overfit on small sets
)

# 3. Poll until complete
while job.status not in ("succeeded", "failed", "cancelled"):
    time.sleep(30)
    job = client.fine_tuning.jobs.retrieve(job.id)
    print(job.status, job.trained_tokens)

if job.status != "succeeded":
    raise RuntimeError(job.error)

fine_tuned_model = job.fine_tuned_model  # ft:gpt-4o-mini:org:support-intent-v1:abc123

# 4. Smoke inference
resp = client.chat.completions.create(
    model=fine_tuned_model,
    messages=[{"role": "user", "content": "Double charge on ORD-9912"}],
)
print(resp.choices[0].message.content)
// Spring AI / OpenAI Java SDK pattern
OpenAiApi api = new OpenAiApi(System.getenv("OPENAI_API_KEY"));

Resource trainingFile = new FileSystemResource("data/ft_support_intent.jsonl");
String fileId = api.uploadFile(trainingFile, "fine-tune").getId();

FineTuningJobRequest req = FineTuningJobRequest.builder()
    .trainingFile(fileId)
    .model("gpt-4o-mini-2024-07-18")
    .suffix("support-intent-v1")
    .hyperparameters(Map.of("n_epochs", 3))
    .build();

FineTuningJob job = api.createFineTuningJob(req);
while (!List.of("succeeded", "failed", "cancelled").contains(job.getStatus())) {
    Thread.sleep(30_000);
    job = api.retrieveFineTuningJob(job.getId());
}

String fineTunedModel = job.getFineTunedModel();
// Route via gateway: model_registry.put("support-intent", fineTunedModel);

Cost estimation

Estimate training tokens ≈ sum of all message tokens × epochs. Example: 800 rows × ~400 tokens/row × 3 epochs ≈ 960K training tokens → ~$7.68 at $8/1M. Add 20% buffer for validation files and failed retries. Inference on fine-tuned models uses the same per-token inference rates as the base model tier—FT does not reduce inference cost, only improves task accuracy per call.

💰 Cost

Training: ~$8/1M tokens (verify live pricing). A 1,000-row job often lands under $15 total—cheap compared to engineer time if eval proves lift. Hidden cost: running 5 failed experiments without eval gates burns budget and morale.

AWS Bedrock custom models

For teams standardized on Bedrock, custom model import and model customization jobs store artifacts in S3 and expose inference through Bedrock runtime with IAM boundaries. Flow parallels OpenAI: prepare JSONL/JSON → upload to S3 → create customization job → deploy custom model ARN in gateway routing table alongside foundation model IDs.

Bedrock customization (boto3 sketch)
import boto3

bedrock = boto3.client("bedrock")

# Exact API varies by base model family — consult Bedrock model customization docs
job = bedrock.create_model_customization_job(
    jobName="support-intent-v1",
    customModelName="support-intent-classifier",
    roleArn="arn:aws:iam::123456789012:role/BedrockCustomizationRole",
    baseModelIdentifier="anthropic.claude-3-haiku-20240307-v1:0",
    trainingDataConfig={
        "s3Uri": "s3://ml-datasets/ft/support-intent-v1.jsonl"
    },
    outputDataConfig={
        "s3Uri": "s3://ml-models/custom/support-intent-v1/"
    },
    hyperParameters={"epochCount": "3", "batchSize": "8"},
)

# Poll describe_model_customization_job → status Complete
# Register customModelArn in llm-gateway bedrock routing
🔬 Under the Hood

OpenAI counts every token in every message across all epochs toward training bill. Shorter system prompts in training data reduce cost but must match production system prompt semantics—or you train a distribution shift into the adapter.

⚠️ Pitfall

Setting n_epochs too high on <200 rows overfits memorization: perfect training loss, worse held-out eval. Start with 2–3 epochs; use validation loss curves if provider exposes them; early-stop on eval regression.

💡 Pro Tip

Store fine_tuning.job.id, training file hash, dataset semver, and resulting model ID in your model registry YAML. When auditors ask “what data trained model X?” you answer from git, not Slack history.

Eval before & after: never ship without comparison

A fine-tuned model that beats your intuition but loses 4 points on refund intent recall is a production incident waiting to happen. Run the same eval suite against base and fine-tuned endpoints; promote only when aggregate and per-bucket metrics improve or tie within noise with acceptable cost/latency trade-offs.

Reuse Track 5 infrastructure: golden JSONL, scorers, CI thresholds from evaluation explained, CI evaluation, and LLM-as-judge patterns. Fine-tune eval is not a separate discipline—it is another model variant in the same regression matrix as prompt v2 or RAG index v3.

Comparison protocol

  1. Freeze golden set — held-out rows never in training JSONL
  2. Run base model — record scores, latencies, token usage per case ID
  3. Run fine-tuned model — identical prompts, temperature, max_tokens
  4. Diff per bucket — intent, locale, risk tier; block if any bucket drops >5%
  5. Human review sample — 20 regressions where FT scored lower than base
  6. Sign-off artifact — JSON report committed or attached to deploy PR
MetricAccept FT ifBlock if
Exact match (classification)+3% absolute vs baseAny safety bucket regression
JSON schema valid rate+5% or base <95% and FT ≥98%FT < base
LLM judge faithfulness+0.05 mean (RAG tasks)−0.03 on policy bucket
P95 latencyWithin 10% of base>25% slower without accuracy gain
Base vs fine-tuned eval comparison
"""scripts/compare_ft_models.py — run before any FT deploy."""
from dataclasses import dataclass
from pathlib import Path
import json

@dataclass
class EvalRow:
    id: str
    input: str
    expected: dict
    tags: list[str]

@dataclass
class CaseResult:
    row_id: str
    model: str
    score: float
    output: str
    latency_ms: float

def load_golden(path: Path) -> list[EvalRow]:
    return [EvalRow(**json.loads(line)) for line in path.read_text().splitlines() if line.strip()]

def score_exact_json(output: str, expected: dict) -> float:
    try:
        parsed = json.loads(output)
        return 1.0 if parsed == expected else 0.0
    except json.JSONDecodeError:
        return 0.0

def run_suite(rows: list[EvalRow], model_id: str, complete_fn) -> list[CaseResult]:
    results = []
    for row in rows:
        out, latency = complete_fn(model_id, row.input)
        results.append(CaseResult(row.id, model_id, score_exact_json(out, row.expected), out, latency))
    return results

def compare(base: list[CaseResult], ft: list[CaseResult], min_lift: float = 0.03) -> dict:
    by_id_base = {r.row_id: r for r in base}
    by_id_ft = {r.row_id: r for r in ft}
    lifts = [by_id_ft[i].score - by_id_base[i].score for i in by_id_base]
    mean_lift = sum(lifts) / len(lifts)
    regressions = [i for i in by_id_base if by_id_ft[i].score < by_id_base[i].score]
    return {
        "mean_lift": mean_lift,
        "pass": mean_lift >= min_lift and len(regressions) / len(lifts) < 0.15,
        "regression_ids": regressions[:20],
    }

# Usage:
# base_results = run_suite(golden, "gpt-4o-mini", gateway_complete)
# ft_results = run_suite(golden, "ft:gpt-4o-mini:org:...", gateway_complete)
# report = compare(base_results, ft_results)
# assert report["pass"], report
public record EvalRow(String id, String input, Map<String, Object> expected, List<String> tags) {}
public record CaseResult(String rowId, String model, double score, String output, long latencyMs) {}

public CompareReport compare(List<CaseResult> base, List<CaseResult> ft, double minLift) {
    Map<String, CaseResult> baseMap = base.stream()
        .collect(Collectors.toMap(CaseResult::rowId, r -> r));
    List<Double> lifts = new ArrayList<>();
    List<String> regressions = new ArrayList<>();
    for (CaseResult ftRow : ft) {
        CaseResult b = baseMap.get(ftRow.rowId());
        double lift = ftRow.score() - b.score();
        lifts.add(lift);
        if (lift < 0) regressions.add(ftRow.rowId());
    }
    double meanLift = lifts.stream().mapToDouble(d -> d).average().orElse(0);
    boolean pass = meanLift >= minLift && (double) regressions.size() / lifts.size() < 0.15;
    return new CompareReport(meanLift, pass, regressions.stream().limit(20).toList());
}
⚠️ Pitfall

Never ship without comparison. Teams promote FT after manual vibe-check on 10 examples while base model with an improved prompt would have passed full CI. Wire FT promotion to the same GitHub Actions gate as prompt changes— see continuous evaluation in CI/CD.

🔒 Security

Include adversarial and jailbreak cases from safety guardrails in FT comparison. Fine-tuning can erode refusal behavior if training data lacks negative examples. Zero-tolerance: any safety regression blocks deploy.

🎯 Interview Tip

“How do you validate a fine-tuned model?” — Held-out golden set, base vs FT A/B, bucketed metrics, safety suite unchanged, human review on regressions, model ID pinned in registry with dataset hash. Mention CI gate, not offline accuracy alone.

📦 Real World

Shadow traffic: route 5% of prod requests to FT endpoint, log scores with online eval sampling from observability & monitoring. Promote to 100% only when offline lift confirms and shadow week shows no ticket escalation spike.

CI integration sketch

PR opens (dataset or FT job completes)
  → evals/compare_ft.yaml workflow
  → download golden held-out v2.1
  → run base + candidate ft model
  → upload compare.json artifact
  → PR comment: mean_lift + regression table
  → merge blocked if pass=false

Production: adaptation checklist

Fine-tuned models are production assets: register IDs in the gateway, gate rollout with feature flags, log dataset lineage, and keep a rollback path to base model within one config change. This checklist closes Guide 3 before multimodal ingest pipelines.

Pre-deploy checklist

  • Decision doc — why FT vs prompt/RAG; link to path comparison eval
  • Dataset semver — training JSONL hash in model registry
  • Held-out eval — base vs FT report attached; all buckets green
  • Safety suite — no regression on adversarial golden
  • Gateway entrymodel_alias: support-intent → ft:… in routing config
  • Feature flagft_support_intent_v1 default off; staged % rollout
  • Cost projection — inference tokens unchanged; training cost amortized in FinOps tag
  • Rollback runbook — flip flag to base model; no redeploy required
  • Retrain trigger — backlog threshold + eval plateau documented
  • On-call — ML owner paged on FT latency/error rate anomalies
ControlImplementationAlert when
Model ID drift Registry validates ft: prefix at deploy Unknown model in request log
FT error rate Compare 5xx vs base model route FT >2× base for 15 min
Quality drift Online eval sample 1% Judge score −0.05 vs weekly baseline
Stale weights Policy freshness in RAG; FT age in registry FT >90 days without re-eval
Gateway model registry entry
# config/models.yaml
MODELS = {
    "support-intent": {
        "default": "gpt-4o-mini-2024-07-18",
        "variants": {
            "ft_v1": {
                "model_id": "ft:gpt-4o-mini:org:support-intent-v1:abc123",
                "dataset": "ft-support-intent-v1.2.0",
                "dataset_sha256": "a1b2c3…",
                "eval_report": "reports/ft_v1_compare.json",
                "flag": "ft_support_intent_v1",
            }
        },
    }
}

def resolve_model(feature: str, flags: dict) -> str:
    cfg = MODELS[feature]
    if flags.get("ft_support_intent_v1"):
        return cfg["variants"]["ft_v1"]["model_id"]
    return cfg["default"]
@ConfigurationProperties("llm.models")
public class ModelRegistry {
    Map<String, FeatureModels> features;

    public String resolve(String feature, FeatureFlags flags) {
        var f = features.get(feature);
        if (flags.isEnabled("ft_support_intent_v1")) {
            return f.variants().get("ft_v1").modelId();
        }
        return f.defaultModel();
    }
}
📦 Production checklist

Guide 3 complete when: FT model is behind a flag, registry documents dataset lineage, CI blocks deploy without compare report, rollback tested in staging, and on-call knows which flag to disable. Continue to multimodal ingest for document pipelines.

⚖️ Trade-off

Maintaining both base and FT endpoints doubles model surface area for security reviews and cost attribution. Sunset base-only path only when FT survives two re-eval cycles—not after first successful week.

💰 Cost

Fine-tuning does not reduce inference token prices. Budget FT when it reduces downstream cost: fewer repair loops, smaller prompts, or replacing frontier with fine-tuned mini on the same quality bar.

🔬 Under the Hood

OpenAI fine-tuned models share rate limits with your org tier on that model family. Load-test FT endpoint before 100% flag— cold start is rare but concurrency limits bite during flash traffic.

⚠️ Pitfall

Deploying FT for knowledge that should be RAG: model answers confidently after policy change until someone notices support tickets. Keep RAG for facts; FT for format and routing. Re-run decision matrix on every retrain request.