sharpbyte.dev
← LLMs
LLMs · topic 9 of 11

3 techniques to train an LLM using another LLM

Big “teacher” models can train smaller “student” models—useful when you need speed and cost savings without starting from zero.

Why distill at all?

Training a frontier model from scratch is out of reach for most teams. Knowledge distillation lets a smaller student learn from a larger teacher that already encodes strong patterns.

The goal is not to clone the teacher byte-for-byte—it is to transfer enough skill that the student runs faster, cheaper, or on-device while staying useful for your product.

You see this in the wild when large foundation models train smaller siblings (Llama family, Gemma, DeepSeek distillates).

A large teacher model supervises a smaller student during training.
A large teacher model supervises a smaller student during training.
Three common ways to transfer knowledge from teacher to student.
Three common ways to transfer knowledge from teacher to student.

Soft-label distillation

The teacher outputs a full probability distribution over the vocabulary for each token—not just its top pick. The student is trained to match that entire shape (often with KL divergence).

Rich signal: the teacher’s “second choices” teach the student what alternatives were plausible. Downside: storing soft labels for internet-scale data is enormous.

Hard-label distillation

The teacher only contributes its chosen token as the label; the student still learns a distribution but is supervised by that single target. Much cheaper to store and ship—DeepSeek-style pipelines popularized hard-label distillation for reasoning-focused students.

Soft labels carry the full distribution; hard labels keep only the teacher’s top token.
Soft labels carry the full distribution; hard labels keep only the teacher’s top token.

Co-distillation

Teacher and student train together on the same batches. The student mimics the teacher’s probabilities while the teacher still learns from ground-truth labels. Early on the teacher is weak too, so pipelines often mix teacher signal with real data labels until the teacher matures.

Key takeaways

  • Distillation trades teacher quality for student efficiency.
  • Soft labels teach more; hard labels store less.
  • Most app teams consume distilled models—they rarely run distillation themselves.