Unsloth + TRL: LoRA, math data, deterministic reward hooks, and GRPOTrainer in action.
Build a Reasoning LLM using GRPO [Hands On].
Group Relative Policy Optimization is a reinforcement learning method that fine-tunes LLMs for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data.
Here's a brief overview of GRPO:
Let’s dive into the code to see how we can use GRPO to turn any model into a reasoning powerhouse without any labeled data or human intervention.
We’ll use UnslothAI for efficient fine-tuning and HuggingFace TRL to apply GRPO.
The code is available here: Build a reasoning LLM from scratch using GRPO. You can run it without any installations by reproducing our environment below:
Let’s begin!
#1) Load the model
We start by loading Qwen3-4B-Base and its tokenizer using Unsloth. You can use any other open-weight LLM here.
#2) Define LoRA config
We'll use LoRA to avoid fine-tuning the entire model weights. In this code, we use Unsloth's PEFT by specifying:
#3) Create the dataset
We load the Open R1 Math dataset (a math problem dataset) and format it for reasoning.
Each sample includes:
#4) Define reward functions
In GRPO, we use deterministic functions to validate the response and assign a reward. No manual labelling required!
The reward functions:
#5) Use GRPO and start training
Now that we have the dataset and reward functions ready, it's time to apply GRPO. HuggingFace TRL provides everything we described in the GRPO diagram, out of the box, in the form of the GRPOConfig and GRPOTrainer.
Comparison
Again, we can see how GRPO turned a base model into a reasoning powerhouse.
RFT methods like GRPO work best when paired with reliable reinforcement learning environments. This brings us to an important component of RL-based fine-tuning: how agents interact with environments.