LLMs · topic 5 of 11

How to train an LLM from scratch

After the architecture exists, training turns random weights into a helpful assistant—in clear stages from pre-training through alignment and reasoning.

Why training happens in stages

Building the Transformer stack is only half the story. The model still needs to learn language and behave well in real products—answering questions, following instructions, staying safe, and reasoning when the task allows it.

Labs rarely do all of that in one pass. Instead they chain stages, each with a different goal and dataset:

Pre-training — absorb language and world knowledge from huge text corpora.
Instruction fine-tuning — learn to follow prompts and reply in a helpful format.
Preference fine-tuning — align with human taste when there is no single “correct” answer.
Reasoning fine-tuning — sharpen math and logic using verifiable right/wrong signals.

Overview: random init → pre-train → instruct → preference → reasoning.

Stage 0: Randomly initialized LLM

Day zero is a model with random weights. It has never seen text, so it knows nothing about language.

Ask it “What is an LLM?” and you might get nonsense like try peter hand and hello 448Sn—fluent-looking fragments with no meaning.

This stage matters as a mental anchor: every capability you see later was added by training, not wired in at birth.

Before any data: outputs are random token soup.

Stage 1: Pre-training

Pre-training teaches the fundamentals by showing the model massive corpora and asking it to predict the next token over and over.

Along the way it picks up grammar, facts about the world, coding syntax, common phrases, and more—similar to reading a large slice of the internet and books.

The catch: a pre-trained model is mainly a text continuer. Prompt it with a question and it may ramble forward like an unfinished paragraph instead of answering cleanly. Conversation skills still lie ahead.

Next-token prediction on huge corpora builds raw language skill.

Stage 2: Instruction fine-tuning

To make the model conversational, teams train on instruction–response pairs: a user-style prompt and a target reply (often written or reviewed by humans).

The model learns patterns like “when asked a question, answer directly” and “when asked to summarize, produce a summary”—not merely continue the prompt.

This stage is where chatbots start to feel familiar. The weights are still the same architecture; the data objective changes.

Prompt + desired response pairs teach helpful formats.

What instruction tuning unlocks—and where it stops

After instruction fine-tuning, the same model can usually:

Answer questions in natural language.
Summarize long documents.
Write or explain code, and handle many other prompt-driven tasks.

By now the lab has likely burned through two expensive resources: most of the raw internet-scale pre-training data they planned to use, and the budget for high-quality human-written instruction pairs.

So how do you keep improving? The next stages lean on reinforcement learning (RL)—training with rewards instead of only copying static text.

Stage 3: Preference fine-tuning (PFT)

Some qualities are subjective: tone, politeness, verbosity, safety. There is no single ground-truth sentence—only what humans prefer.

You may have seen this in ChatGPT: “Which response do you prefer?” with two answers side by side. That is not just product feedback—it becomes training data.

OpenAI and others use these comparisons for preference fine-tuning. In short:

A user (or rater) chooses between two model responses.
Those choices build a dataset of human preferences.
A reward model learns to predict which answer humans would prefer.
The main LLM is updated with RL so its outputs score higher on that reward model.

Side-by-side responses feed preference data—not just star ratings.

Reward model + RL update loop (often called RLHF).

RLHF and PPO—in plain language

The preference pipeline above is commonly called RLHF (Reinforcement Learning with Human Feedback). The LLM is the policy; the reward model scores its answers; an algorithm such as PPO (Proximal Policy Optimization) adjusts weights so good answers become more likely—without collapsing training stability.

Why bother? Because many goals—“be helpful,” “refuse unsafe requests,” “match brand voice”—are easier to express through comparisons than by writing perfect gold-standard replies for every situation.

RLHF teaches alignment when there is no single correct string, only better and worse options.

Stage 4: Reasoning fine-tuning

Math, logic, and structured puzzles are different: there often is a right answer and a known path to get there.

Here human preferences matter less; correctness can be the reward. The usual loop:

The model generates an answer to a prompt (often with visible reasoning steps).
The answer is checked against a known correct result.
A reward is assigned from that check—full credit for right, less or none for wrong.
RL updates the model to favor chains that lead to correct outcomes.

This family is sometimes called reinforcement learning with verifiable rewards. GRPO (used by DeepSeek and others) is a popular modern variant for reasoning-heavy models.

Reasoning fine-tuning sits alongside preference tuning: one sharpens factual step-by-step work; the other sharpens subjective quality.

Verifiable tasks: compare to ground truth, assign reward, update weights.

The full journey in one glance

Putting it all together—the path from scratch to a product-ready assistant:

Start from a randomly initialized model.
Pre-train on large-scale text to learn language and knowledge.
Use instruction fine-tuning so the model follows commands and chats usefully.
Use preference and reasoning fine-tuning to align tone and harden logic.

What this means if you are not training the next GPT

Most teams do not run this full pipeline. They download or call an API model that already went through these stages, then add RAG, prompts, guardrails, or a small fine-tune on their own data.

Knowing the stages still helps you debug (“is this a base-model knowledge gap or missing instruction tuning?”) and talk credibly with vendors about safety, RLHF, and reasoning models.

Key takeaways

Training is staged: pre-training for knowledge, then instruction tuning, then RL for preferences and reasoning.
Pre-trained models continue text; instruction tuning makes them answer like assistants.
RLHF (often with PPO) aligns subjective quality; verifiable rewards (e.g., GRPO) help math and logic.
Product builders usually start from a finished base model—not from random weights.