APIs expose levers that shape the same next-token choice—learn what each dial does before you blame the model.
Every completion is still “pick the next token, repeat.” These API parameters change how that pick happens. They interact—changing temperature alone can undo careful prompt design—so log them when debugging production issues.
A hard cap on how many tokens the model may generate in one response. Too low and answers truncate mid-thought; too high and you pay for rambling output you did not need.
Controls randomness by scaling logits before softmax. Use lower values (0–0.3) for factual Q&A, support bots, and structured extraction; higher values (0.7–1.0) for brainstorming or creative writing.
top_k keeps only the K highest-probability tokens before sampling—simple but blunt.
top_p keeps the smallest set of tokens whose cumulative probability reaches p (e.g., 0.9). When the model is confident, the pool is tiny; when uncertain, more tokens stay eligible.
frequency_penalty down-weights tokens proportional to how often they already appeared—helps stop loops like “the the the.”
presence_penalty penalizes any token that appeared at least once, nudging the model toward new vocabulary and topics.
Strings that halt generation immediately when emitted. Essential for JSON blocks, numbered lists, or multi-step agents that need the model to stop before the next tool call.
Some runtimes offer min_p: keep tokens whose probability is at least a fraction of the top token’s probability. When the model is confident, the pool shrinks; when it is uncertain, more tokens stay in play. It is a newer alternative to fixing top_p at 0.9 for every prompt.