LLMs · topic 7 of 11

7 LLM generation parameters

APIs expose levers that shape the same next-token choice—learn what each dial does before you blame the model.

Your inference control panel

Every completion is still “pick the next token, repeat.” These API parameters change how that pick happens. They interact—changing temperature alone can undo careful prompt design—so log them when debugging production issues.

Common parameters you will see in OpenAI-compatible APIs.

max_tokens

A hard cap on how many tokens the model may generate in one response. Too low and answers truncate mid-thought; too high and you pay for rambling output you did not need.

temperature

Controls randomness by scaling logits before softmax. Use lower values (0–0.3) for factual Q&A, support bots, and structured extraction; higher values (0.7–1.0) for brainstorming or creative writing.

top_k and top_p (nucleus)

top_k keeps only the K highest-probability tokens before sampling—simple but blunt.

top_p keeps the smallest set of tokens whose cumulative probability reaches p (e.g., 0.9). When the model is confident, the pool is tiny; when uncertain, more tokens stay eligible.

frequency and presence penalties

frequency_penalty down-weights tokens proportional to how often they already appeared—helps stop loops like “the the the.”

presence_penalty penalizes any token that appeared at least once, nudging the model toward new vocabulary and topics.

stop sequences

Strings that halt generation immediately when emitted. Essential for JSON blocks, numbered lists, or multi-step agents that need the model to stop before the next tool call.

Bonus: min-p sampling

Some runtimes offer min_p: keep tokens whose probability is at least a fraction of the top token’s probability. When the model is confident, the pool shrinks; when it is uncertain, more tokens stay in play. It is a newer alternative to fixing top_p at 0.9 for every prompt.

Key takeaways

Treat parameters as a set—tune them together on real prompts.
Structured outputs: low temperature + explicit stop sequences.
Log settings alongside prompts when debugging production issues.