1.d.i. How do we model natural language?

Natural language in its textual form may be represented as a sequence of tokens. The specific granularity of the tokens in these sequences varies from the coarsest word level, to individual characters, to different variations of subword morphemes, token metadata tags, and punctuation. A Language Model (LM) is a statistical model of natural language that generates a probability distribution for these sequences .

In particular, given a sequence of words \(w_{1:n}\), we wish to estimate the probability \(P(w_{1:n})\). In the general case, we can use the probability chain rule

\[\displaystyle P(w_{1:n}) = P(w_n \mid w_{1:n - 1}) \cdot P(w_{n - 1} \mid w_{1:n - 2}) \cdots P(w_2 \mid w_1) \cdot P(w_1)\]

to confirm our intuition that a language model's understanding of a given word in a sequence relies on an understanding of the full context from the current word all the back to the first word in the sequence. That is, understanding future tokens in a sequence requires understanding not only the present, but also the entire past history of the sequence.

Of course, with any probabilistic model, we must have a method of scoring the model. In Shannon's seminal work on information theory he discussed modeling an information source inherent information using entropy . In the discrete case, Shannon entropy is defined as

\[\displaystyle\entropy(p) = - \sum_{i=1}^n p(x_i) \log_b p(x_i) \]

where \(p(x_i)\) is the probability of state \(x_i\), and \(\sum_i p(x_i) = 1\). Entropy can be interpreted as the information content of the modeled information source.

Then the cross entropy of two distributions \(p\) and \(q\) can be defined in the discrete case as

\[\displaystyle\entropy(p, q) = - \sum_{i=1}^n p(x_i) \log_b q(x_i) \]

Cross entropy can be thought of as a measure of the difference between two probability distributions . Thus, cross entropy is often used as an objective function in optimization tasks, because we want to minimize the difference between the probability distribution of the training set (as an estimate of the true probability distribution of the whole), and the distribution of the statistical model's output.

Perplexity is defined as

\[\displaystyle\perplexity(p) = b^{- \sum_{i=1}^n p(x_i) \log_b p(x_i)} \]

which is precisely \(b^{\entropy(p)}\), or the exponentiation of Shannon entropy ! Due to its similarity, supposes that perplexity is preferred over cross-entropy as a performance metric due to the more impressive nature of being able to claim large reductions in perplexity as opposed to a cross-entropy loss of a fraction of a bit.

However, when scoring a language model, we rarely (if ever) know the true probability distribution of the language a dataset it sampled from. So we modify the equation above to approximate the perplexity of a language model from the training set probability distribution \(\tilde p\), and the language model's output distribution

\[\displaystyle\widetilde{\perplexity}(\tilde p, \model) = b^{- \sum_{i=1}^{n} \tilde{p}(w_i) \log_b \model(w_i)}\]

where our training dataset is formed of \(n\) words \(w_1, \dots, w_n\). But the word \(w_i\) is sampled uniformly from the dataset, so \(\tilde p(w_i) = \frac{1}{n}\), which results in the definition of perplexity that gives.

\[\displaystyle\perplexity(\model) = 2^{-\frac{1}{n}\sum_{i=1}^n \log_2 \model(w_i \mid w_{1:i-1})} \]

Notice that the worst possible language model would be a random choice of the next token in a sequence with a uniform distribution. Such a model would have each \(\model(w_i \mid w_{1:i-1}) = \frac{1}{n}\), which would result in a high perplexity score. A good language model — one that is reflective of "real" language — will assign high probabilities to observed events in the test corpus, which results in minimizing the perplexity score . Notice, however, that perplexities are corpus specific — rendering it impossible to compare scores between two language models trained on different datasets.