Fine-tuning & adaptation
Fine-tuning bakes behavior into weights; RAG and prompts change inputs at query time. Teams often fine-tune because it feels like “real ML,” then discover refund policies still wrong because facts live in PDFs that change weekly. Never fine-tune for knowledge alone—use RAG for facts, prompts for tone and format, and fine-tuning only when you have stable patterns, labeled data, and eval proof that adaptation paths cannot reach your bar.
After reading, you should be able to: choose among prompt engineering, RAG, routing, and fine-tuning with a decision matrix; compare full FT, LoRA, QLoRA, and instruction tuning; prepare JSONL training data with quality gates; run OpenAI fine-tuning jobs and read lifecycle states; estimate training cost at ~$8/1M tokens; run before/after eval suites against base and fine-tuned models; wire Bedrock custom model imports; and complete the Track 6 adaptation checklist before production rollout.
When to fine-tune: prompt vs RAG vs fine-tune
Adaptation changes inputs and orchestration at query time. Fine-tuning changes the model’s default behavior by updating weights. The wrong choice wastes GPU budget, creates stale knowledge, and blocks fast policy updates. Start with the cheapest path that can pass your eval suite—not the path that sounds most impressive.
Most production LLM features need three capabilities: correct facts, consistent format, and appropriate tone. These map to different adaptation layers. Facts that change weekly belong in a retrieval index, not in memorized weights. Stable JSON schemas and classification boundaries are fine-tuning sweet spots when prompt engineering plateaus on your golden set.
Four adaptation paths
| Path | What moves | Best when | Weak when |
|---|---|---|---|
| Prompt engineering | System instructions, few-shot examples, output schema | Tone, refusals, structured output, short policy summaries | Long factual corpora stuffed into context; brittle format at scale |
| RAG | Retrieved chunks injected at query time | Policies, docs, catalogs that change often; cite-able answers | Needle-in-haystack without good index; latency-sensitive micro-classifiers |
| Routing + tools | Model tier, agent loop, external APIs | Structured workflows (refunds, lookups, calculations) | Pure free-text FAQ where every answer is generative |
| Fine-tuning | Model weights (full or adapter) | Stable style, intent classification, extraction schemas, domain phrasing | Facts change weekly; insufficient labeled data; prompt+RAG already pass eval |
Decision matrix
Score each incoming use case against these signals. If policy or catalog freshness is critical, RAG wins regardless of team preference. If the task is “always return this JSON shape for support tickets,” try structured output + prompt first; fine-tune only when eval shows persistent format drift across hundreds of real examples.
| Signal | Prefer | Avoid fine-tune because |
|---|---|---|
| Answer must cite source documents | RAG + citation prompt | FT cannot guarantee retrieval; weights hallucinate confidently |
| Policy updates > once/month | RAG | Retrain lag; stale weights until next job completes |
| Fixed output schema (100+ fields) | Structured output → then FT if needed | FT helps only after native schema mode exhausted |
| Intent routing (5–20 labels) | Mini model + prompt → FT on mini | Frontier FT is expensive for simple classification |
| Brand voice stable 12+ months | Prompt → FT on style rows | Prompt alone may work; FT is optimization not unblocker |
| Multilingual phrasing in one domain | FT or adapter on open model | Prompts balloon per locale |
Never fine-tune for knowledge alone. If the model must know this quarter’s refund policy, product SKUs, or internal org chart, that information belongs in RAG or a database tool—not in weights. Fine-tuned “facts” go stale silently; users get confident wrong answers with no audit trail. RAG gives you chunk IDs to log; fine-tuning gives you regret.
Run the same golden eval through prompt-only, RAG, and (if justified) a fine-tuned endpoint before committing to FT. If prompt + RAG already meet your threshold, fine-tuning is a cost/latency optimization—not a project unblocker. Document the comparison in the PR that approves GPU spend.
Decision flowchart
flowchart TD
A[New LLM feature request] --> B{Does answer require
citing live documents?}
B -->|Yes| C[Use RAG]
B -->|No| D{Is behavior stable
12+ months?}
D -->|No| E[Prompt + tools + routing]
D -->|Yes| F{Prompt + structured output
passes eval?}
F -->|Yes| G[Ship prompt path — no FT]
F -->|No| H{Have 500+ quality
labeled examples?}
H -->|No| I[Collect data — improve prompt/RAG first]
H -->|Yes| J{Before/after eval
shows FT lift?}
J -->|No| K[Do not fine-tune]
J -->|Yes| L[Fine-tune with CI eval gate]
C --> M[Track chunk freshness SLA]
L --> N[Register model ID in gateway + flags]
Support teams often request fine-tuning after 50 “bad tone” examples. In practice, a revised system prompt + 8 curated few-shot pairs fixes 80% of tone issues in a day. Reserve fine-tuning for classification F1 stuck below target after prompt iteration, or extraction JSON that fails schema validation on >5% of production traffic despite response_format.
When asked “RAG or fine-tune?” answer with the freshness axis: RAG for mutable facts, fine-tune for stable behavior patterns. Mention eval gates—never ship FT without base vs fine-tuned comparison on the same golden set. Interviewers listen for “we fine-tuned the retriever embedding model” vs “we fine-tuned GPT-4 for policy”—different problems, different data.
Prompts and RAG update in minutes; fine-tuning updates in hours to days plus re-eval cycles. The operational cost of FT is not the training bill—it is the retrain cadence and model registry every time behavior must change. Prefer adaptation paths your on-call can roll back with a flag flip.
Prerequisite context: RAG explained, system prompts, and LLM gateway for routing fine-tuned model IDs alongside base models.
Approaches: full FT, LoRA, QLoRA, instruction tuning
“Fine-tuning” spans updating every weight in a 70B model to training a 16MB LoRA adapter on a laptop. Hosted APIs (OpenAI, Bedrock) hide the technique; self-hosted open models expose the full menu. Pick based on model access, GPU budget, and whether you need mergeable artifacts for air-gapped deploy.
Comparison table
| Approach | What updates | VRAM / cost | Typical use | Production notes |
|---|---|---|---|---|
| Full fine-tuning | All model parameters | Very high (multi-GPU, 70B+ impractical without cluster) | Domain shift on open models; research; small models (<7B) | Produces full checkpoint; hard to A/B vs base without duplicate serving |
| LoRA | Low-rank adapter matrices on attention/MLP layers | Moderate (often 1–4× less than full FT) | Instruction following, style, classification on 7B–70B | Swap adapters at runtime; multiple tasks on one base |
| QLoRA | LoRA on 4-bit quantized base weights | Low (consumer GPU for 7B–13B) | Prototyping, internal tools, cost-sensitive open-model FT | Quantization-aware; validate perplexity vs bf16 base before prod |
| Instruction tuning | SFT on (instruction, response) pairs—full or LoRA | Depends on base technique | Chat alignment, tool-call format, JSON extraction | OpenAI hosted FT is instruction-style on chat messages JSONL |
Hosted vs self-hosted
OpenAI fine-tuning is managed instruction tuning: you upload JSONL, they train, you receive ft:gpt-4o-mini:org:... model IDs. You do not choose LoRA rank or learning rate on the public API. AWS Bedrock offers custom model import (Continued Pre-training, fine-tuning on supported base models) with S3-stored output artifacts and provisioned throughput options. Self-hosted (Hugging Face PEFT, Axolotl, LLaMA-Factory) gives full control over LoRA/QLoRA hyperparameters and merge-to-single-checkpoint for vLLM or TGI serving.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # rank — higher = more capacity, more VRAM
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)
# Train only ~0.1–1% of parameters; save adapter dir, not full 7B weights
// Self-hosted LoRA typically runs via Python training jobs.
// Java serving layer loads base + adapter metadata from model registry:
public record AdapterManifest(
String baseModelId, // e.g. meta-llama/Llama-3.1-8B
String adapterPath, // s3://models/support-intent-lora-v3
int rank,
String taskType
) {}
// Gateway resolves: tenant → adapter manifest → inference endpoint
LoRA injects trainable low-rank matrices ΔW = BA alongside frozen weights W. At inference, effective weight is W + BA (or merged once for latency). QLoRA keeps W in 4-bit NF4 while adapters stay fp16/bf16— why a 24GB GPU can fine-tune 13B models that full FT would exclude.
Self-hosted QLoRA on a 7B model: roughly $2–10 in spot GPU for a prototype run. Full FT on 70B without adapters: thousands per experiment. Hosted OpenAI FT charges per training token (~$8/1M)—often cheaper than engineer + GPU time for small classification tasks on mini models if you already use the API in production.
Training a LoRA on stale base model weights, then serving a newer base at inference—adapter mismatch causes garbage outputs. Pin base model revision in your registry (meta-llama/Llama-3.1-8B@sha256:…) and block deploy if base hash differs from training manifest.
When each approach wins
- Hosted OpenAI FT — fast path for gpt-4o-mini classification/extraction; no GPU ops team
- Bedrock custom model — enterprise AWS footprint, IAM-bound artifacts, private VPC inference
- LoRA self-hosted — multiple tenant-specific adapters on one GPU pool; air-gapped deploy
- Full FT — rare in prod; consider only small specialist models (<3B) or embedding retrievers
Training data may contain PII and secrets scrubbed from prompts but present in historical tickets. Run redaction pipelines before JSONL export; restrict fine-tune dataset buckets to ML roles; never log raw training rows in CI artifacts. Bedrock and OpenAI retain data per their fine-tuning policies—read DPA before uploading customer content.
Data requirements: volume, format, quality
Fine-tuning quality is bounded by label quality, not row count alone. Fifty pristine examples can beat five thousand noisy auto-labeled pairs. Still, most instruction-tuning tasks need 50–100 examples minimum to show signal; 500–1,000 diverse rows is the practical bar for production confidence on classification and extraction.
Volume guidelines
| Task type | Minimum | Production target | Notes |
|---|---|---|---|
| Binary / 3-class intent | 50–100 per class | 500+ total, balanced | Watch minority class recall |
| JSON extraction (10 fields) | 100 examples | 800–1,000 | Cover optional/null field combos |
| Style / tone alignment | 50 high-quality pairs | 200–400 curated | Quality >> quantity; one bad row teaches bad habits |
| Multi-turn tool format | 80 trajectories | 500+ | Include negative tool-call abstentions |
Quality over quantity. One hour of SME review on 100 rows beats overnight LLM-generated synthetic bulk. Auto-label with a strong model, then human-accept/reject—never train on unreviewed synthetic data for compliance-facing tasks.
JSONL format (OpenAI chat fine-tuning)
Each line is one JSON object with a messages array. Roles: system, user, assistant. Optional weight on assistant messages for DPO-style preference data on supported pipelines. Validate with provider CLI before upload.
# data/ft_support_intent.jsonl — one JSON object per line
EXAMPLE_CLASSIFY = {
"messages": [
{"role": "system", "content": "Classify support tickets. Reply with JSON: {\"intent\": \"...\", \"priority\": \"...\"}"},
{"role": "user", "content": "My card was charged twice for order ORD-8821. Need refund ASAP."},
{"role": "assistant", "content": "{\"intent\": \"billing_duplicate_charge\", \"priority\": \"high\"}"},
]
}
EXAMPLE_EXTRACT = {
"messages": [
{"role": "system", "content": "Extract refund fields as JSON. Use null for missing."},
{"role": "user", "content": "Customer Jane Doe, order ORD-4420, wants partial refund $24.50 for damaged item."},
{"role": "assistant", "content": "{\"order_id\": \"ORD-4420\", \"customer_name\": \"Jane Doe\", \"amount_usd\": 24.50, \"reason\": \"damaged_item\"}"},
]
}
import json
from pathlib import Path
def write_jsonl(rows: list[dict], path: Path) -> None:
with path.open("w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
// Jackson writes one JSON object per line for OpenAI upload
record ChatMessage(String role, String content) {}
record FineTuneRow(List<ChatMessage> messages) {}
void writeJsonl(List<FineTuneRow> rows, Path path) throws IOException {
var mapper = new ObjectMapper();
try (var writer = Files.newBufferedWriter(path)) {
for (FineTuneRow row : rows) {
writer.write(mapper.writeValueAsString(row));
writer.newLine();
}
}
}
// Example row — intent classification
var classify = new FineTuneRow(List.of(
new ChatMessage("system", "Classify support tickets as JSON intent + priority."),
new ChatMessage("user", "Charged twice on ORD-8821, need refund."),
new ChatMessage("assistant", "{\"intent\":\"billing_duplicate_charge\",\"priority\":\"high\"}")
));
Data hygiene checklist
- Deduplicate — near-duplicate tickets overweight frequent phrasing
- Stratify — hold out 15–20% validation split never seen in training
- Tag provenance — source: human|synthetic|prod_sample per row
- Version datasets — append-only git LFS or S3 with semver (ft-v1.3.0.jsonl)
- PII scrub — replace names/emails with placeholders consistent at train and eval time
Augmentation (careful)
Safe augmentations: paraphrase user text while preserving gold label; swap entity values (order IDs, amounts) in extraction tasks; translate to secondary locale if prod is multilingual. Unsafe: LLM-generated assistant responses without review; mixing contradictory labels for the same input; augmenting policy answers that should come from RAG.
import copy, random, re
ORDER_RE = re.compile(r"ORD-\d{4}")
def augment_order_ids(row: dict, n: int = 3) -> list[dict]:
"""Swap order IDs in user text; assistant JSON updated to match."""
out = []
base_user = row["messages"][1]["content"]
base_asst = json.loads(row["messages"][2]["content"])
for _ in range(n):
new_id = f"ORD-{random.randint(1000, 9999)}"
new_row = copy.deepcopy(row)
new_row["messages"][1]["content"] = ORDER_RE.sub(new_id, base_user, count=1)
asst = copy.deepcopy(base_asst)
asst["order_id"] = new_id
new_row["messages"][2]["content"] = json.dumps(asst)
new_row["metadata"] = {"augmentation": "entity_swap"}
out.append(new_row)
return out
List<FineTuneRow> augmentOrderIds(FineTuneRow base, int n, Random rng) {
var results = new ArrayList<FineTuneRow>();
var userText = base.messages().get(1).content();
for (int i = 0; i < n; i++) {
String newId = "ORD-" + (1000 + rng.nextInt(9000));
String newUser = userText.replaceFirst("ORD-\\d{4}", newId);
// Parse assistant JSON, swap order_id, re-serialize — same pattern as Python
results.add(/* rebuilt row */);
}
return results;
}
Training on production logs without filtering failed or escalated conversations teaches the model to reproduce bad agent behavior. Include only rows where human SME or post-hoc eval marked the response as correct.
Mature teams maintain a fine-tune backlog fed by eval failures: when golden case X fails for 3 consecutive prompt versions, SME writes 5–10 corrected pairs and opens a dataset PR. Fine-tune jobs trigger only when backlog crosses 500 reviewed rows and prompt path plateaus—same loop as CI evaluation.
Synthetic data scales fast but encodes generator biases. A model fine-tuned on GPT-4-generated labels may inherit GPT-4 failure modes your base mini model would otherwise avoid. Cap synthetic ratio (e.g. ≤30%) unless human-reviewed.
OpenAI fine-tuning: upload, jobs, models, cost
OpenAI’s managed fine-tuning pipeline: validate JSONL → upload file → create job → poll until succeeded → deploy returned model ID through your gateway. Training is billed per token processed during the job—plan ~$8 per 1M training tokens for recent mini-model rates (verify current pricing before budgeting).
Supported models (typical)
Eligible base models change with provider releases. Common production choices:
| Base model | Best for | Training cost order |
|---|---|---|
| gpt-4o-mini | Classification, extraction, routing | Low — default first FT experiment |
| gpt-4o | Complex format + reasoning style | Higher train + inference |
| gpt-3.5-turbo (legacy) | Existing FT continuity | Check deprecation notices |
Check OpenAI docs for current supported_models before pipeline automation.
Job lifecycle
- validating_files — schema and token count checks
- queued — waiting for capacity
- running — training in progress (minutes to hours)
- succeeded — fine_tuned_model ID available
- failed — inspect error object; fix data, retry
- cancelled — user or timeout abort
from openai import OpenAI
import time
client = OpenAI()
# 1. Upload training file
with open("data/ft_support_intent.jsonl", "rb") as f:
train_file = client.files.create(file=f, purpose="fine-tune")
# 2. Create job
job = client.fine_tuning.jobs.create(
training_file=train_file.id,
model="gpt-4o-mini-2024-07-18",
suffix="support-intent-v1",
hyperparameters={"n_epochs": 3}, # start conservative; watch overfit on small sets
)
# 3. Poll until complete
while job.status not in ("succeeded", "failed", "cancelled"):
time.sleep(30)
job = client.fine_tuning.jobs.retrieve(job.id)
print(job.status, job.trained_tokens)
if job.status != "succeeded":
raise RuntimeError(job.error)
fine_tuned_model = job.fine_tuned_model # ft:gpt-4o-mini:org:support-intent-v1:abc123
# 4. Smoke inference
resp = client.chat.completions.create(
model=fine_tuned_model,
messages=[{"role": "user", "content": "Double charge on ORD-9912"}],
)
print(resp.choices[0].message.content)
// Spring AI / OpenAI Java SDK pattern
OpenAiApi api = new OpenAiApi(System.getenv("OPENAI_API_KEY"));
Resource trainingFile = new FileSystemResource("data/ft_support_intent.jsonl");
String fileId = api.uploadFile(trainingFile, "fine-tune").getId();
FineTuningJobRequest req = FineTuningJobRequest.builder()
.trainingFile(fileId)
.model("gpt-4o-mini-2024-07-18")
.suffix("support-intent-v1")
.hyperparameters(Map.of("n_epochs", 3))
.build();
FineTuningJob job = api.createFineTuningJob(req);
while (!List.of("succeeded", "failed", "cancelled").contains(job.getStatus())) {
Thread.sleep(30_000);
job = api.retrieveFineTuningJob(job.getId());
}
String fineTunedModel = job.getFineTunedModel();
// Route via gateway: model_registry.put("support-intent", fineTunedModel);
Cost estimation
Estimate training tokens ≈ sum of all message tokens × epochs. Example: 800 rows × ~400 tokens/row × 3 epochs ≈ 960K training tokens → ~$7.68 at $8/1M. Add 20% buffer for validation files and failed retries. Inference on fine-tuned models uses the same per-token inference rates as the base model tier—FT does not reduce inference cost, only improves task accuracy per call.
Training: ~$8/1M tokens (verify live pricing). A 1,000-row job often lands under $15 total—cheap compared to engineer time if eval proves lift. Hidden cost: running 5 failed experiments without eval gates burns budget and morale.
AWS Bedrock custom models
For teams standardized on Bedrock, custom model import and model customization jobs store artifacts in S3 and expose inference through Bedrock runtime with IAM boundaries. Flow parallels OpenAI: prepare JSONL/JSON → upload to S3 → create customization job → deploy custom model ARN in gateway routing table alongside foundation model IDs.
import boto3
bedrock = boto3.client("bedrock")
# Exact API varies by base model family — consult Bedrock model customization docs
job = bedrock.create_model_customization_job(
jobName="support-intent-v1",
customModelName="support-intent-classifier",
roleArn="arn:aws:iam::123456789012:role/BedrockCustomizationRole",
baseModelIdentifier="anthropic.claude-3-haiku-20240307-v1:0",
trainingDataConfig={
"s3Uri": "s3://ml-datasets/ft/support-intent-v1.jsonl"
},
outputDataConfig={
"s3Uri": "s3://ml-models/custom/support-intent-v1/"
},
hyperParameters={"epochCount": "3", "batchSize": "8"},
)
# Poll describe_model_customization_job → status Complete
# Register customModelArn in llm-gateway bedrock routing
OpenAI counts every token in every message across all epochs toward training bill. Shorter system prompts in training data reduce cost but must match production system prompt semantics—or you train a distribution shift into the adapter.
Setting n_epochs too high on <200 rows overfits memorization: perfect training loss, worse held-out eval. Start with 2–3 epochs; use validation loss curves if provider exposes them; early-stop on eval regression.
Store fine_tuning.job.id, training file hash, dataset semver, and resulting model ID in your model registry YAML. When auditors ask “what data trained model X?” you answer from git, not Slack history.
Eval before & after: never ship without comparison
A fine-tuned model that beats your intuition but loses 4 points on refund intent recall is a production incident waiting to happen. Run the same eval suite against base and fine-tuned endpoints; promote only when aggregate and per-bucket metrics improve or tie within noise with acceptable cost/latency trade-offs.
Reuse Track 5 infrastructure: golden JSONL, scorers, CI thresholds from evaluation explained, CI evaluation, and LLM-as-judge patterns. Fine-tune eval is not a separate discipline—it is another model variant in the same regression matrix as prompt v2 or RAG index v3.
Comparison protocol
- Freeze golden set — held-out rows never in training JSONL
- Run base model — record scores, latencies, token usage per case ID
- Run fine-tuned model — identical prompts, temperature, max_tokens
- Diff per bucket — intent, locale, risk tier; block if any bucket drops >5%
- Human review sample — 20 regressions where FT scored lower than base
- Sign-off artifact — JSON report committed or attached to deploy PR
| Metric | Accept FT if | Block if |
|---|---|---|
| Exact match (classification) | +3% absolute vs base | Any safety bucket regression |
| JSON schema valid rate | +5% or base <95% and FT ≥98% | FT < base |
| LLM judge faithfulness | +0.05 mean (RAG tasks) | −0.03 on policy bucket |
| P95 latency | Within 10% of base | >25% slower without accuracy gain |
"""scripts/compare_ft_models.py — run before any FT deploy."""
from dataclasses import dataclass
from pathlib import Path
import json
@dataclass
class EvalRow:
id: str
input: str
expected: dict
tags: list[str]
@dataclass
class CaseResult:
row_id: str
model: str
score: float
output: str
latency_ms: float
def load_golden(path: Path) -> list[EvalRow]:
return [EvalRow(**json.loads(line)) for line in path.read_text().splitlines() if line.strip()]
def score_exact_json(output: str, expected: dict) -> float:
try:
parsed = json.loads(output)
return 1.0 if parsed == expected else 0.0
except json.JSONDecodeError:
return 0.0
def run_suite(rows: list[EvalRow], model_id: str, complete_fn) -> list[CaseResult]:
results = []
for row in rows:
out, latency = complete_fn(model_id, row.input)
results.append(CaseResult(row.id, model_id, score_exact_json(out, row.expected), out, latency))
return results
def compare(base: list[CaseResult], ft: list[CaseResult], min_lift: float = 0.03) -> dict:
by_id_base = {r.row_id: r for r in base}
by_id_ft = {r.row_id: r for r in ft}
lifts = [by_id_ft[i].score - by_id_base[i].score for i in by_id_base]
mean_lift = sum(lifts) / len(lifts)
regressions = [i for i in by_id_base if by_id_ft[i].score < by_id_base[i].score]
return {
"mean_lift": mean_lift,
"pass": mean_lift >= min_lift and len(regressions) / len(lifts) < 0.15,
"regression_ids": regressions[:20],
}
# Usage:
# base_results = run_suite(golden, "gpt-4o-mini", gateway_complete)
# ft_results = run_suite(golden, "ft:gpt-4o-mini:org:...", gateway_complete)
# report = compare(base_results, ft_results)
# assert report["pass"], report
public record EvalRow(String id, String input, Map<String, Object> expected, List<String> tags) {}
public record CaseResult(String rowId, String model, double score, String output, long latencyMs) {}
public CompareReport compare(List<CaseResult> base, List<CaseResult> ft, double minLift) {
Map<String, CaseResult> baseMap = base.stream()
.collect(Collectors.toMap(CaseResult::rowId, r -> r));
List<Double> lifts = new ArrayList<>();
List<String> regressions = new ArrayList<>();
for (CaseResult ftRow : ft) {
CaseResult b = baseMap.get(ftRow.rowId());
double lift = ftRow.score() - b.score();
lifts.add(lift);
if (lift < 0) regressions.add(ftRow.rowId());
}
double meanLift = lifts.stream().mapToDouble(d -> d).average().orElse(0);
boolean pass = meanLift >= minLift && (double) regressions.size() / lifts.size() < 0.15;
return new CompareReport(meanLift, pass, regressions.stream().limit(20).toList());
}
Never ship without comparison. Teams promote FT after manual vibe-check on 10 examples while base model with an improved prompt would have passed full CI. Wire FT promotion to the same GitHub Actions gate as prompt changes— see continuous evaluation in CI/CD.
Include adversarial and jailbreak cases from safety guardrails in FT comparison. Fine-tuning can erode refusal behavior if training data lacks negative examples. Zero-tolerance: any safety regression blocks deploy.
“How do you validate a fine-tuned model?” — Held-out golden set, base vs FT A/B, bucketed metrics, safety suite unchanged, human review on regressions, model ID pinned in registry with dataset hash. Mention CI gate, not offline accuracy alone.
Shadow traffic: route 5% of prod requests to FT endpoint, log scores with online eval sampling from observability & monitoring. Promote to 100% only when offline lift confirms and shadow week shows no ticket escalation spike.
CI integration sketch
PR opens (dataset or FT job completes)
→ evals/compare_ft.yaml workflow
→ download golden held-out v2.1
→ run base + candidate ft model
→ upload compare.json artifact
→ PR comment: mean_lift + regression table
→ merge blocked if pass=false
Production: adaptation checklist
Fine-tuned models are production assets: register IDs in the gateway, gate rollout with feature flags, log dataset lineage, and keep a rollback path to base model within one config change. This checklist closes Guide 3 before multimodal ingest pipelines.
Pre-deploy checklist
- Decision doc — why FT vs prompt/RAG; link to path comparison eval
- Dataset semver — training JSONL hash in model registry
- Held-out eval — base vs FT report attached; all buckets green
- Safety suite — no regression on adversarial golden
- Gateway entry — model_alias: support-intent → ft:… in routing config
- Feature flag — ft_support_intent_v1 default off; staged % rollout
- Cost projection — inference tokens unchanged; training cost amortized in FinOps tag
- Rollback runbook — flip flag to base model; no redeploy required
- Retrain trigger — backlog threshold + eval plateau documented
- On-call — ML owner paged on FT latency/error rate anomalies
| Control | Implementation | Alert when |
|---|---|---|
| Model ID drift | Registry validates ft: prefix at deploy | Unknown model in request log |
| FT error rate | Compare 5xx vs base model route | FT >2× base for 15 min |
| Quality drift | Online eval sample 1% | Judge score −0.05 vs weekly baseline |
| Stale weights | Policy freshness in RAG; FT age in registry | FT >90 days without re-eval |
# config/models.yaml
MODELS = {
"support-intent": {
"default": "gpt-4o-mini-2024-07-18",
"variants": {
"ft_v1": {
"model_id": "ft:gpt-4o-mini:org:support-intent-v1:abc123",
"dataset": "ft-support-intent-v1.2.0",
"dataset_sha256": "a1b2c3…",
"eval_report": "reports/ft_v1_compare.json",
"flag": "ft_support_intent_v1",
}
},
}
}
def resolve_model(feature: str, flags: dict) -> str:
cfg = MODELS[feature]
if flags.get("ft_support_intent_v1"):
return cfg["variants"]["ft_v1"]["model_id"]
return cfg["default"]
@ConfigurationProperties("llm.models")
public class ModelRegistry {
Map<String, FeatureModels> features;
public String resolve(String feature, FeatureFlags flags) {
var f = features.get(feature);
if (flags.isEnabled("ft_support_intent_v1")) {
return f.variants().get("ft_v1").modelId();
}
return f.defaultModel();
}
}
Guide 3 complete when: FT model is behind a flag, registry documents dataset lineage, CI blocks deploy without compare report, rollback tested in staging, and on-call knows which flag to disable. Continue to multimodal ingest for document pipelines.
Maintaining both base and FT endpoints doubles model surface area for security reviews and cost attribution. Sunset base-only path only when FT survives two re-eval cycles—not after first successful week.
Fine-tuning does not reduce inference token prices. Budget FT when it reduces downstream cost: fewer repair loops, smaller prompts, or replacing frontier with fine-tuned mini on the same quality bar.
OpenAI fine-tuned models share rate limits with your org tier on that model family. Load-test FT endpoint before 100% flag— cold start is rare but concurrency limits bite during flash traffic.
Deploying FT for knowledge that should be RAG: model answers confidently after policy change until someone notices support tickets. Keep RAG for facts; FT for format and routing. Re-run decision matrix on every retrain request.