At generation time the model assigns probabilities to every possible next token—then we choose one using sampling rules and knobs like temperature.
Suppose you know the sky is cloudy. Rain feels more likely than a heat wave. That is conditional probability: how likely is event A given we already know B?
LLMs do the same with text: given all prior tokens (the context), what is the chance of each candidate next token?
During training the model adjusts its internal weights until those chances match patterns in real text.
Example: after “I like to play…” the model might assign high probability to football, cricket, or tennis—but almost zero to banana, because context makes some continuations far more plausible than others.
Always choosing the highest-probability token is called greedy decoding. It is fast and deterministic, but answers can feel repetitive—the model may loop or sound robotic.
Most chat apps instead sample: treat the probability distribution like a weighted lottery so occasional less-likely tokens still appear. That variety is what makes two runs with the same prompt differ slightly.
Temperature scales logits before converting to probabilities. Low temperature (near 0) concentrates mass on a few tokens → predictable, factual tone. Higher temperature spreads probability → more creative, but also more risk of tangents or nonsense.
When outputs feel “too safe” or “too wild,” temperature is often the first knob to tune—alongside the prompt itself.