After the architecture exists, training turns random weights into a helpful assistant—in clear stages from pre-training through alignment and reasoning.
Building the Transformer stack is only half the story. The model still needs to learn language and behave well in real products—answering questions, following instructions, staying safe, and reasoning when the task allows it.
Labs rarely do all of that in one pass. Instead they chain stages, each with a different goal and dataset:
Day zero is a model with random weights. It has never seen text, so it knows nothing about language.
Ask it “What is an LLM?” and you might get nonsense like try peter hand and hello 448Sn—fluent-looking fragments with no meaning.
This stage matters as a mental anchor: every capability you see later was added by training, not wired in at birth.
Pre-training teaches the fundamentals by showing the model massive corpora and asking it to predict the next token over and over.
Along the way it picks up grammar, facts about the world, coding syntax, common phrases, and more—similar to reading a large slice of the internet and books.
The catch: a pre-trained model is mainly a text continuer. Prompt it with a question and it may ramble forward like an unfinished paragraph instead of answering cleanly. Conversation skills still lie ahead.
To make the model conversational, teams train on instruction–response pairs: a user-style prompt and a target reply (often written or reviewed by humans).
The model learns patterns like “when asked a question, answer directly” and “when asked to summarize, produce a summary”—not merely continue the prompt.
This stage is where chatbots start to feel familiar. The weights are still the same architecture; the data objective changes.
After instruction fine-tuning, the same model can usually:
By now the lab has likely burned through two expensive resources: most of the raw internet-scale pre-training data they planned to use, and the budget for high-quality human-written instruction pairs.
So how do you keep improving? The next stages lean on reinforcement learning (RL)—training with rewards instead of only copying static text.
Some qualities are subjective: tone, politeness, verbosity, safety. There is no single ground-truth sentence—only what humans prefer.
You may have seen this in ChatGPT: “Which response do you prefer?” with two answers side by side. That is not just product feedback—it becomes training data.
OpenAI and others use these comparisons for preference fine-tuning. In short:
The preference pipeline above is commonly called RLHF (Reinforcement Learning with Human Feedback). The LLM is the policy; the reward model scores its answers; an algorithm such as PPO (Proximal Policy Optimization) adjusts weights so good answers become more likely—without collapsing training stability.
Why bother? Because many goals—“be helpful,” “refuse unsafe requests,” “match brand voice”—are easier to express through comparisons than by writing perfect gold-standard replies for every situation.
RLHF teaches alignment when there is no single correct string, only better and worse options.
Math, logic, and structured puzzles are different: there often is a right answer and a known path to get there.
Here human preferences matter less; correctness can be the reward. The usual loop:
This family is sometimes called reinforcement learning with verifiable rewards. GRPO (used by DeepSeek and others) is a popular modern variant for reasoning-heavy models.
Reasoning fine-tuning sits alongside preference tuning: one sharpens factual step-by-step work; the other sharpens subjective quality.
Putting it all together—the path from scratch to a product-ready assistant:
Most teams do not run this full pipeline. They download or call an API model that already went through these stages, then add RAG, prompts, guardrails, or a small fine-tune on their own data.
Knowing the stages still helps you debug (“is this a base-model knowledge gap or missing instruction tuning?”) and talk credibly with vendors about safety, RLHF, and reasoning models.