Recall from here that a language model \(\model\) is a statistical model of a sequence of tokens \(\{w_1, w_2, w_3, \dots\}\) where the probability of the token \(w_{k+1}\) is conditioned on all of the preceding tokens \(w_{1:k}\), denoted \(\model\left(w_{k+1} \mid w_{1:k}\right)\). Since the output of a language model is a probability distribution, we sample the distribution generated by \(\model(w_2 \mid w_1)\) to generate \(w_2\) conditioned on \(w_1\). Then we sample the distribution generated by \(\model(w_3 \mid w_{1:2})\) to generate \(w_3\) conditioned on \(\{w_1, w_2\}\), and so on . The method used to sample the probability distribution can vary, as does the quality of the generated sequence.
The sequence of tokens \(\{w_1, w_2, w_3, \dots\}\) can be a sequence of characters, or a sequence of
words. Often, we also insert meta-tokens into the sequence, such as punctuation, a start-of-sequence tag
(<s>
), and an end-of-sequence tag (</s>
) so that the language
model can condition its predictions based on even more information.
When sampling the probability distribution generated by \(\model\left(w_{k+1} \mid w_{1:k}\right)\), one strategy is to always pick the token with the highest probability. Another strategy might be to randomly sample the probability distribution. This way the output of the generative network is more diverse and creative.
We do this by transforming the softmax distribution
\[\softmax(\vec x) = \frac{\exp\left(\vec x\right)}{\sum\exp\left(\vec x\right)}\]
\[\softmax(\vec x, T) = \frac{\exp\left(\vec x \middle/ T \right)}{\sum\exp\left(\vec x \middle/ T\right)}\]
This temperature transformation can occur during the training of the network, where the normal softmax activation layer is replaced with the above transformation. We can also apply the temperature transformation after training during the generation phase by piping the softmax output values through the temperature transformation
\[\operatorname{sample}(\vec y, T) = \frac{\exp\left(\log(\vec y) \middle/ T\right)}{\sum\exp\left(\log(\vec y) \middle/ T\right)}\]
after \(\softmax\) has been applied to the logits \(\vec x\). Note that \[\operatorname{sample}(\softmax(\vec x), T)\] and \[\softmax(\vec x, T)\] are equivalent.
Lower temperature values produce a more extreme distribution profile, while higher temperatures smooth out the probability distribution. This means that text generated with a lower temperature is more confident in its choices, but that it is also more conservative than text generated with a high temperature. Likewise, text generated with a high temperature is more diverse because any peaks in the probability distribution get smoothed out. However, the added creativity and diversity that using a higher temperature provides comes at the risk of generating nonsense.
See Temperature Sampling for an empirical treatment of temperature sampling, with plots showing the effect of temperature.