Predicting probabilities is only half the story—you still need a strategy to pick each next token.
The model always outputs a probability distribution. Your decoding strategy decides which token actually gets appended. Same weights, same prompt—different strategies can produce flat, creative, or globally coherent text.
Pick the highest-probability token at every step. Fast and deterministic, but prone to repetition and local optima—the model never explores slightly lower-probability words that might lead to a better sentence overall.
Sample from the full distribution (often after top_p or top_k filtering). Temperature controls how flat the lottery is. This is what most chat UIs use so replies feel natural rather than identical every time.
Maintain several partial hypotheses in parallel (beams). At each step, extend each beam and keep only the best-scoring paths. Slower but can improve global coherence—common in machine translation, less common in open-ended chat.
Penalize candidates that are too similar to tokens already in the output. Helps long-form generation stay diverse without wandering into nonsense as aggressively as very high temperature.
Most decoding uses only the final layer’s scores. Earlier layers sometimes hold factual signal that gets washed out. SLED blends logits across layers toward agreement before picking the next token—often improving factuality without retraining.