Fine-tuning · topic 7 of 11

Generate your own IFT dataset

Synthetic instruction–response pairs with Distilabel: multi-LLM pipelines, judges, and seed data.

Why instruction data

Generate Your Own LLM Fine-tuning Dataset (IFT).

Once an LLM has been pre-trained, it simply continues the sentence as if it is one long text in a book or an article.

For instance, check this to understand how a pre-trained LLM behaves when prompted:

Generating a synthetic dataset using existing LLMs and utilizing it for fine-tuning can improve this.

The synthetic data will have fabricated examples of human-AI interactions.

Check this sample:

This process is called instruction fine-tuning and it is described in the animation below:

Distilabel is an open-source framework that facilitates generating domain-specific synthetic text data using LLMs.

Check this to understand the underlying process:

And you get the synthetic dataset!

Next, let's look at the code.

First, we start with some standard imports:

Next, we load the Llama-3 models locally with Ollama:

Moving on, we define our pipeline:

First, we load the dataset (we’ll pass it shortly).
Next, we generate two responses.
Once done, we combine the responses into one column (under the hood, a prompt template is also created for the third LLM).
Moving on, we evaluate the responses with an LLM.
Finally, we define and run the pipeline.

Once the pipeline has been defined, we need to execute it by giving it a seed dataset.

The seed dataset helps it generate new but similar samples. So we execute the pipeline with our seed dataset as follows:

Done!

This produces the instruction and response synthetic dataset as desired.

Check the sample below:

That was simple, wasn’t it?

IFT teaches format and behavior; synthetic pipelines scale pairs beyond hand labeling.
A judge model plus multiple generators is a common quality filter pattern.