Recall from Text Generation, that a softmax layer is defined as
\[\softmax(\vec x) = \frac{\exp\left(\vec x\right)}{\sum\exp\left(\vec x\right)}\]
Also recall that you can sample from a softmax layer by using \(\softmax(\vec x, T)\) defined as
\[\softmax(\vec x, T) = \frac{\exp\left(\vec x \middle/ T \right)}{\sum\exp\left(\vec x \middle/ T\right)}\]
However, as written, this is a part of the network architecture, something that's baked into the neural network during training time. Every single implementation I've seen of temperature sampling performs the sampling on the regular softmax values output by the network after training is completed. These implementations are of the form
\[\operatorname{sample}(\vec y, T) = \frac{\exp\left(\log(\vec y) \middle/ T\right)}{\sum\exp\left(\log(\vec y) \middle/ T\right)}\]
Text Generation claims that \(\operatorname{sample}(\softmax(\vec x), T)\) and \(\softmax(\vec x, T)\) are equivalent. You can do the math to verify this, but it's not very fun. This is an empirical verification of the ugly mathematics that I wrote down in a notebook I can no longer find. Lesson learned.
The first thing we do is define \(\softmax\) and \(\operatorname{sample}\) as regular Python functions that operate on NumPy arrays.
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import seaborn as sns
sns.set(style="whitegrid")
def softmax(values, temperature=1.0):
preds = np.exp(values / temperature)
return preds / np.sum(preds)
def sample(values, temperature=1.0):
preds = np.exp(np.log(values) / temperature)
return preds / np.sum(preds)
Then we define any old probability distribution, and add gaussian noise so that the results of \(\operatorname{sample}(\softmax(\vec x), T)\) and \(\softmax(\vec x, T)\) are indistinguishable. I chose to use the Argus distribution just to have something other than a regular normal distribution to look at.
x = np.linspace(start=0, stop=1, num=500)
likelihoods = sp.stats.argus.pdf(x, chi=1, loc=0, scale=1)
# Add random noise
likelihoods_1 = likelihoods + np.random.normal(loc=0, scale=0.005, size=len(likelihoods))
likelihoods_2 = likelihoods + np.random.normal(loc=0, scale=0.005, size=len(likelihoods))
# Calculate the softmax, both temperature sampled and not, and sample the unsampled softmax values
softs_1 = softmax(likelihoods_1)
softs_2 = softmax(likelihoods_2, temperature=0.8)
sampled = sample(softs_1, temperature=0.8)
plt.plot(x, softs, label=r"$\mathrm{softmax}(\vec x)$")
plt.plot(x, softmax(likelihoods2, temperature=0.8), label=r"$\mathrm{softmax}(\vec x, t=0.8)$")
plt.plot(x, sample(softs, temperature=0.8), label=r"$\mathrm{sample}(\mathrm{softmax}(\vec x), t=0.8)$")
plt.legend()
plt.show()
We can see from the figure below that, sans noise, the plots of \(\operatorname{sample}(\softmax(\vec x), T)\) and \(\softmax(\vec x, T)\) are indeed indistinguishable.
Text Generation also discusses the effect of temperature sampling on the resulting probability distribution. A picture is worth a thousand words, so let's make a nice plot showing the effect.
x = np.linspace(start=0, stop=1, num=100)
likelihoods = sp.stats.argus.pdf(x, chi=1, loc=0, scale=1)
likelihoods = likelihoods + np.random.normal(loc=0, scale=0.005, size=len(likelihoods))
for temp in (0.6, 0.8, 1, 1.5, 2):
plt.plot(x, softmax(likelihoods, temp), label=r"$\mathrm{softmax}(\vec x, t=%.1f)$" % temp)
plt.legend()
plt.show()
High temperatures flatten out the distribution, raising the low probabilities, and lowering the high probabilities. Low temperatures make the peaks more pronounced. In other words, if we temperature sample, we can, in a sense, consider the temperature to be the amount of randomness used when sampling. Higher temperatures result in more randomness, resulting in more "creativity" but less coherency. Lower temperatures result in more coherency (in the extreme case you would just pick the highest probable value), but at the cost of less creativity.