Almost every chat API exposes a slider called temperature. The default is usually 1.0, the floor is 0.0, the ceiling is 2.0, and the documentation says something vague about creativity. Most people drag it around and watch what happens. Almost nobody explains what the number is actually doing, which is unfortunate, because it is doing exactly one thing, and the thing is small enough to fit on a postcard.
Here is the postcard. When the model finishes a forward pass, it emits a vector of raw scores called logits, one per token in the vocabulary. Logits are not probabilities. They can be negative, they can be huge, and they do not sum to anything in particular. To turn them into probabilities you run them through softmax, which exponentiates each one and normalises by the total. That is the default. Temperature inserts itself one step earlier. It divides every logit by T before the exponential. So the formula becomes P(x_i) = exp(l_i / T) over the sum of exp(l_j / T) for the whole vocabulary. That is the whole intervention. One scalar, applied uniformly, before softmax.
What this does to the distribution is the only thing worth understanding. Dividing by a small T (say 0.2) makes the gaps between logits five times bigger. After softmax, the already- high-scoring token absorbs almost all of the probability mass and everything else goes to a rounding error. The model becomes boring and consistent. Dividing by a large T (say 1.5) does the opposite: it squashes the gaps, the exponential can no longer amplify the leader, and the unlikely tokens get a real chance. The model becomes noisier and less self-consistent. T=1 is the identity, the original distribution, no scaling at all. T=0 is a special case (the formula divides by zero), so the major APIs quietly swap in greedy decoding instead, always take the top-ranked token.
There is a tidy worked example in a MachineLearningMastery walk-through where the prompt is "Today's weather is so" and the top candidate is "nice". At T=0.1, T=0.5, T=1.0 the model picks "nice" every time. At T=3.0 it drifts to "wonderful". At T=10 it lands on "delicious" and the sentence stops meaning anything. The model is not getting more creative in any meaningful sense. It is getting noisier. Some of the noise looks like creativity because human readers reach for the closest interpretation, the same way we do when staring at a Rorschach blot.
This is also where the relationship to hallucination lives. Higher temperature does not invent facts the model didn't know. It promotes lower-probability continuations the model had already considered and ranked low. Sometimes the second-best guess is genuinely useful and sometimes it is the chemistry that launches a confident, fluent, completely wrong sentence. The underlying problem (no internal verification step) is the same either way; temperature just changes how often the model gets to roll for it.
The practical upshot is duller than the slider's mystique. For factual extraction, code generation, structured output: keep T low or zero, and rely on top-k or top-p to manage the long tail if you need diversity at all. For brainstorming, fiction, playful prose: edge it up to 0.8 or 1.0, but expect to throw away more output. Above 1.5 the model is mostly rolling dice weighted by a distribution it no longer respects, and the returns are sharply diminishing.
The interesting thing about temperature is what it isn't. It is not a personality knob, not a politeness dial, not a "make the answer better" control. It is a single scalar that reshapes how much weight the model gives to its own confidence before sampling. Everything that feels like vibes (creativity, caution, weirdness) is a downstream artefact of that one division.
Sources:
-
LLM Sampling: Temperature, Top-K, Top-P, and Min-P Explained — Let's Data Science
-
How LLMs Choose Their Words: A Practical Walk-Through of Logits, Softmax and Sampling — Machine Learning Mastery
-
Temperature - LLM Parameter Guide — Vellum
-
Understanding Temperature and Top P in Large Language Models — Lundgren.io