LLMs · topic 8 of 11

4 LLM text generation strategies

Predicting probabilities is only half the story—you still need a strategy to pick each next token.

Why strategy matters

The model always outputs a probability distribution. Your decoding strategy decides which token actually gets appended. Same weights, same prompt—different strategies can produce flat, creative, or globally coherent text.

Different strategies trade speed, creativity, and global coherence.

Greedy decoding

Pick the highest-probability token at every step. Fast and deterministic, but prone to repetition and local optima—the model never explores slightly lower-probability words that might lead to a better sentence overall.

Multinomial sampling

Sample from the full distribution (often after top_p or top_k filtering). Temperature controls how flat the lottery is. This is what most chat UIs use so replies feel natural rather than identical every time.

Beam search

Maintain several partial hypotheses in parallel (beams). At each step, extend each beam and keep only the best-scoring paths. Slower but can improve global coherence—common in machine translation, less common in open-ended chat.

Multiple candidate sentences compete; the highest-scoring full path wins.

Contrastive search

Penalize candidates that are too similar to tokens already in the output. Helps long-form generation stay diverse without wandering into nonsense as aggressively as very high temperature.

Bonus: SLED decoding

Most decoding uses only the final layer’s scores. Earlier layers sometimes hold factual signal that gets washed out. SLED blends logits across layers toward agreement before picking the next token—often improving factuality without retraining.

Combine signals from early, middle, and final layers.

Key takeaways

No single strategy wins every product—match decoding to the use case.
Chat apps usually sample; batch translation may beam search.
Advanced decoders (contrastive, SLED) target repetition and hallucination.