sharpbyte.dev
← Fine-tuning
Fine-tuning · topic 7 of 11

Generate your own IFT dataset

Synthetic instruction–response pairs with Distilabel: multi-LLM pipelines, judges, and seed data.

Why instruction data

Generate Your Own LLM Fine-tuning Dataset (IFT).

Once an LLM has been pre-trained, it simply continues the sentence as if it is one long text in a book or an article.

For instance, check this to understand how a pre-trained LLM behaves when prompted:

Generating a synthetic dataset using existing LLMs and utilizing it for fine-tuning can improve this.

Using LLMs to generate synthetic fine-tuning data (deck).
Using LLMs to generate synthetic fine-tuning data (deck).

The synthetic data will have fabricated examples of human-AI interactions.

Check this sample:

Sample human–AI interaction from the deck.
Sample human–AI interaction from the deck.

This process is called instruction fine-tuning and it is described in the animation below:

Instruction fine-tuning process animation from the deck.
Instruction fine-tuning process animation from the deck.

Distilabel is an open-source framework that facilitates generating domain-specific synthetic text data using LLMs.

Check this to understand the underlying process:

Distilabel overview for synthetic data.
Distilabel overview for synthetic data.
  • Input an instruction.
  • Two LLMs generate responses.
  • A judge LLM rates the responses.
  • The best response is paired with the instruction.

And you get the synthetic dataset!

Next, let's look at the code.

First, we start with some standard imports:

Starter imports for the Distilabel pipeline.
Starter imports for the Distilabel pipeline.

Next, we load the Llama-3 models locally with Ollama:

Ollama loads Llama-3 for local generation.
Ollama loads Llama-3 for local generation.

Moving on, we define our pipeline:

Pipeline blocks wired for generate → judge.
Pipeline blocks wired for generate → judge.
  • First, we load the dataset (we’ll pass it shortly).
  • Next, we generate two responses.
  • Once done, we combine the responses into one column (under the hood, a prompt template is also created for the third LLM).
  • Moving on, we evaluate the responses with an LLM.
  • Finally, we define and run the pipeline.

Once the pipeline has been defined, we need to execute it by giving it a seed dataset.

The seed dataset helps it generate new but similar samples. So we execute the pipeline with our seed dataset as follows:

Run pipeline on seed data; inspect output sample.
Run pipeline on seed data; inspect output sample.

Done!

This produces the instruction and response synthetic dataset as desired.

Check the sample below:

That was simple, wasn’t it?

Key takeaways

  • IFT teaches format and behavior; synthetic pipelines scale pairs beyond hand labeling.
  • A judge model plus multiple generators is a common quality filter pattern.