Static labeled pairs vs online rewards—and a decision tree for when each fits.
Two fine-tuning regimes
This produces a dataset on which the LLM can be easily fine-tuned.
So far, we’ve explored how to fine-tune a model efficiently using LoRA and its variants, and what kind of data is typically used through instruction fine-tuning.
The next question is: how do different fine-tuning objectives actually change the learning process?
Broadly, fine-tuning falls into two categories.
Supervised Fine-Tuning (SFT)
Reinforcement Fine-Tuning (RFT)
Both update the model using LoRA or similar PEFT methods, but their goals and training signals differ dramatically.
SFT vs RFT processes
SFT vs RFT. Before diving deeper, it's crucial to understand how we usually fine-tune LLMs using SFT, or supervised fine-tuning.
SFT process:
SFT: static prompt–completion dataset and deployment.
It starts with a static labeled dataset of prompt–completion pairs.
Adjust the model weights to match these completions.
The best model (LoRA checkpoint) is then deployed for inference.
RFT process:
RFT: sample, score with a reward, update online.
RFT uses an online “reward” approach—no static labels required.
The model explores different outputs, and a Reward Function scores their correctness.
Over time, the model learns to generate higher-reward answers using GRPO.
SFT uses static data and often memorizes answers. RFT, being online, learns from rewards and explores new strategies.
Contrast memorization vs exploration.
Choosing a strategy
This flowchart gives a quick guide on which fine-tuning method to use based on your data and the nature of the task.
Start by checking whether you have labelled (ground-truth) data.
If you don’t, the next question is whether the task is verifiable.
If not verifiable, you use RLHF, since humans must provide preference signals.
If verifiable, RFT works because correctness can be automatically checked.
If you do have labelled data, the choice depends on how much you have:
Large datasets → use SFT.
Tiny datasets → ask if reasoning (like CoT) helps.
If yes → RFT
If no → SFT
Overall, this decision tree helps you quickly identify the most efficient and reliable fine-tuning strategy for your use case.
Now that we understand when to use SFT or RFT, let's apply RFT in practice.
For reasoning-heavy tasks like math or logic, GRPO (Group Relative Policy Optimization) is one of the most effective RFT methods available. Let’s walk through how to fine-tune a model using GRPO with Unsloth.