sharpbyte.dev
← Learning hub
AI ecosystem · Fine-tuning

Fine-tuning, from adapters to RL

Eleven topics from the same beginner-friendly deck as our LLM and Prompt engineering tracks—why full fine-tuning breaks at LLM scale, the LoRA family and QLoRA/DoRA, building datasets for instruction tuning, SFT vs reward-driven RFT (GRPO), and how OpenEnv + ART standardize RL-style training for agents.

Topic 1

What is fine-tuning?

Adapt pretrained weights on new data—classic motivation before LLM-scale limits.

Read topic →
Topic 2

Issues with traditional fine-tuning

Billions of parameters, GPU RAM, multi-tenant storage—why full fine-tuning doesn’t scale for LLMs.

Read topic →
Topic 3

Five LLM fine-tuning techniques (and bonuses)

LoRA family, LoRA-drop, QLoRA, DoRA—parameter-efficient paths when full updates are prohibitive.

Read topic →
Topic 4

Implementing LoRA from scratch

ΔW, frozen W, and why naïvely matching sizes defeats the point—setup for the low-rank trick.

Read topic →
Topic 5

How LoRA works

Factor ΔW ≈ AB with rank r, train only A and B, interpret as a expressivity vs rank trade-off grid.

Read topic →
Topic 6

PyTorch LoRA implementation

LoRAWeights, alpha scaling, attaching adapters to fc layers, and training only A/B.

Read topic →
Topic 7

Generate your own IFT dataset

Synthetic instruction–response pairs with Distilabel: multi-LLM pipelines, judges, and seed data.

Read topic →
Topic 8

SFT vs RFT

Static labeled pairs vs online rewards—and a decision tree for when each fits.

Read topic →
Topic 9

Build a reasoning LLM with GRPO

Unsloth + TRL: LoRA, math data, deterministic reward hooks, and GRPOTrainer in action.

Read topic →
Topic 10

OpenEnv for RL environments

Containerized Gymnasium-style services—reset, step, state—so agents and envs compose cleanly.

Read topic →
Topic 11

Agent Reinforcement Trainer (ART)

OpenPipe’s client/server loop for agent trajectories, rewards, and GRPO-style updates.

Read topic →
How to read this track: skim how LLMs are trained first; Prompt engineering covers training-free steering, while this path updates weights efficiently.