sharpbyte.dev
← LLMs
LLMs · topic 6 of 11

How do LLMs work?

At generation time the model assigns probabilities to every possible next token—then we choose one using sampling rules and knobs like temperature.

Conditional probability (without the fear)

Suppose you know the sky is cloudy. Rain feels more likely than a heat wave. That is conditional probability: how likely is event A given we already know B?

LLMs do the same with text: given all prior tokens (the context), what is the chance of each candidate next token?

During training the model adjusts its internal weights until those chances match patterns in real text.

Example: after “I like to play…” the model might assign high probability to football, cricket, or tennis—but almost zero to banana, because context makes some continuations far more plausible than others.

Context narrows which next words are plausible—like weather given clouds.
Context narrows which next words are plausible—like weather given clouds.
The model outputs a score for every token in the vocabulary, then turns scores into probabilities.
The model outputs a score for every token in the vocabulary, then turns scores into probabilities.

Greedy decoding vs. sampling

Always choosing the highest-probability token is called greedy decoding. It is fast and deterministic, but answers can feel repetitive—the model may loop or sound robotic.

Most chat apps instead sample: treat the probability distribution like a weighted lottery so occasional less-likely tokens still appear. That variety is what makes two runs with the same prompt differ slightly.

Greedy always picks the top token; sampling rolls weighted dice over the distribution.
Greedy always picks the top token; sampling rolls weighted dice over the distribution.

Temperature and what you feel in the product

Temperature scales logits before converting to probabilities. Low temperature (near 0) concentrates mass on a few tokens → predictable, factual tone. Higher temperature spreads probability → more creative, but also more risk of tangents or nonsense.

When outputs feel “too safe” or “too wild,” temperature is often the first knob to tune—alongside the prompt itself.

Low temperature sharpens the distribution; high temperature flattens it.
Low temperature sharpens the distribution; high temperature flattens it.

Key takeaways

  • Generation is repeated “what token comes next?” given everything so far.
  • Greedy is deterministic; sampling adds variety and is standard in chat.
  • Temperature reshapes how adventurous each next-token choice is.