sharpbyte.dev
← Fine-tuning
Fine-tuning · topic 9 of 11

Build a reasoning LLM with GRPO

Unsloth + TRL: LoRA, math data, deterministic reward hooks, and GRPOTrainer in action.

GRPO overview

Build a Reasoning LLM using GRPO [Hands On].

Group Relative Policy Optimization is a reinforcement learning method that fine-tunes LLMs for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data.

Here's a brief overview of GRPO:

GRPO loop: sample, reward, policy update.
GRPO loop: sample, reward, policy update.
  • Start with a dataset and add a reasoning-focused system prompt (e.g., “Think step by step…”).
  • The LLM generates multiple candidate responses using a sampling engine.
  • Each response is assigned rewards, which are aggregated to produce a score for every generated response.
  • A GRPO loss function uses these rewards to calculate gradients; backpropagation updates the LLM, and the model improves its reasoning ability over time.

Let’s dive into the code to see how we can use GRPO to turn any model into a reasoning powerhouse without any labeled data or human intervention.

We’ll use UnslothAI for efficient fine-tuning and HuggingFace TRL to apply GRPO.

The code is available here: Build a reasoning LLM from scratch using GRPO. You can run it without any installations by reproducing our environment below:

Reproducible environment / notebook snapshot.
Reproducible environment / notebook snapshot.

Let’s begin!

#1) Load the model

We start by loading Qwen3-4B-Base and its tokenizer using Unsloth. You can use any other open-weight LLM here.

Load base model + tokenizer with Unsloth helpers.
Load base model + tokenizer with Unsloth helpers.

#2) Define LoRA config

We'll use LoRA to avoid fine-tuning the entire model weights. In this code, we use Unsloth's PEFT by specifying:

  • The model
  • LoRA low-rank (r)
  • Modules for fine-tuning, etc.
LoRA r, targets, and PEFT configuration.
LoRA r, targets, and PEFT configuration.

#3) Create the dataset

We load the Open R1 Math dataset (a math problem dataset) and format it for reasoning.

Each sample includes:

  • A system prompt enforcing structured reasoning
  • A question from the dataset
  • The answer in the required format
Dataset rows formatted for GRPO training.
Dataset rows formatted for GRPO training.

#4) Define reward functions

In GRPO, we use deterministic functions to validate the response and assign a reward. No manual labelling required!

The reward functions:

Reward helpers overview.
Reward helpers overview.
  • Match format exactly
  • Match format approximately
  • Check the answer
  • Check numbers

#5) Use GRPO and start training

Now that we have the dataset and reward functions ready, it's time to apply GRPO. HuggingFace TRL provides everything we described in the GRPO diagram, out of the box, in the form of the GRPOConfig and GRPOTrainer.

GRPOTrainer wiring: config, model, rewards.
GRPOTrainer wiring: config, model, rewards.

Comparison

Again, we can see how GRPO turned a base model into a reasoning powerhouse.

Before/after comparison on reasoning tasks.
Before/after comparison on reasoning tasks.

RFT methods like GRPO work best when paired with reliable reinforcement learning environments. This brings us to an important component of RL-based fine-tuning: how agents interact with environments.

Key takeaways

  • GRPO scores multiple samples per prompt; gradients push the policy toward higher relative reward.
  • Deterministic reward checks replace manual labels for structured math-style outputs.