Big “teacher” models can train smaller “student” models—useful when you need speed and cost savings without starting from zero.
Training a frontier model from scratch is out of reach for most teams. Knowledge distillation lets a smaller student learn from a larger teacher that already encodes strong patterns.
The goal is not to clone the teacher byte-for-byte—it is to transfer enough skill that the student runs faster, cheaper, or on-device while staying useful for your product.
You see this in the wild when large foundation models train smaller siblings (Llama family, Gemma, DeepSeek distillates).
The teacher outputs a full probability distribution over the vocabulary for each token—not just its top pick. The student is trained to match that entire shape (often with KL divergence).
Rich signal: the teacher’s “second choices” teach the student what alternatives were plausible. Downside: storing soft labels for internet-scale data is enormous.
The teacher only contributes its chosen token as the label; the student still learns a distribution but is supervised by that single target. Much cheaper to store and ship—DeepSeek-style pipelines popularized hard-label distillation for reasoning-focused students.
Teacher and student train together on the same batches. The student mimics the teacher’s probabilities while the teacher still learns from ground-truth labels. Early on the teacher is weak too, so pipelines often mix teacher signal with real data labels until the teacher matures.