SFT vs RFT — Fine-tuning

Two fine-tuning regimes

This produces a dataset on which the LLM can be easily fine-tuned.

So far, we’ve explored how to fine-tune a model efficiently using LoRA and its variants, and what kind of data is typically used through instruction fine-tuning.

The next question is: how do different fine-tuning objectives actually change the learning process?

Broadly, fine-tuning falls into two categories.

Supervised Fine-Tuning (SFT)
Reinforcement Fine-Tuning (RFT)

Both update the model using LoRA or similar PEFT methods, but their goals and training signals differ dramatically.

SFT vs RFT processes

SFT vs RFT. Before diving deeper, it's crucial to understand how we usually fine-tune LLMs using SFT, or supervised fine-tuning.

SFT process:

SFT: static prompt–completion dataset and deployment.

It starts with a static labeled dataset of prompt–completion pairs.
Adjust the model weights to match these completions.
The best model (LoRA checkpoint) is then deployed for inference.

RFT process:

RFT: sample, score with a reward, update online.

RFT uses an online “reward” approach—no static labels required.
The model explores different outputs, and a Reward Function scores their correctness.
Over time, the model learns to generate higher-reward answers using GRPO.

SFT uses static data and often memorizes answers. RFT, being online, learns from rewards and explores new strategies.

Choosing a strategy

This flowchart gives a quick guide on which fine-tuning method to use based on your data and the nature of the task.

Start by checking whether you have labelled (ground-truth) data.
If you don’t, the next question is whether the task is verifiable.
If not verifiable, you use RLHF, since humans must provide preference signals.
If verifiable, RFT works because correctness can be automatically checked.
If you do have labelled data, the choice depends on how much you have:
Large datasets → use SFT.
Tiny datasets → ask if reasoning (like CoT) helps.
If yes → RFT
If no → SFT

Overall, this decision tree helps you quickly identify the most efficient and reliable fine-tuning strategy for your use case.

Decision tree: labels, verifiability, dataset size, reasoning.

Now that we understand when to use SFT or RFT, let's apply RFT in practice.

For reasoning-heavy tasks like math or logic, GRPO (Group Relative Policy Optimization) is one of the most effective RFT methods available. Let’s walk through how to fine-tune a model using GRPO with Unsloth.

Key takeaways

SFT fits abundant clean (prompt, completion) data; RFT fits verifiable rewards without dense labels.
Pick based on label availability, verifiability, and dataset size—the deck gives an explicit flowchart.