A simple approach to language modeling, useful for motivating the rest of our discussion, is to use a Markov chain to model the probability that one word follows another in a given sequence. A key property of Markov models is that they are stateless, or memoryless . That is, that the probability of a transition to a future state depends only on the current state, and not on the history of previous states.
\[P(w_{i + 1} \mid w_{1:i}) \approx P(w_{i+1} \mid w_i)\]
This means that, as stated, Markov chain models are unsuited to modeling textual data due to their lack of observance of the large amount of context needed to understand natural language.
Traditional approaches to building language models relaxes the general requirement of retaining the full history of a sequence, but not as far as the Markov chain model presented in the equation above . These traditional approaches assume the \(k\)th order Markov property — that the next word in a sequence depends on only the last \(k\) words of the sequence instead of the full \(n\) .
\[P(w_{i+1} \mid w_{1:i}) \approx P(w_{i+1} \mid w_{i:i-k})\]
Under this assumption, we can estimate a sequence's probability as
\[P(w_{1:n}) \approx \prod_{i=1}^n P(w_i \mid w_{i-k:i-1})\]
One method of producing this estimate is to use the maximum likelihood estimate (MLE)
\[\hat p (w_{i+1} \mid w_{i-k:i}) = \frac{\countf{w_{i-k:i+1}}}{\countf{w_{i-k:i}}}\]
for each subsequence \(w_{i-k:i}\) in the corpus.
However, one limitation of this approach is its lack of creative capacity. That is, if a subsequence \(w_{i-k:i+1}\) was never observed in the corpus, then its estimated probability is zero .
At first, this does not seem to pose a problem. If a sequence was not observed in a training corpus, we ought not expect a language model that understands that sequence. However, due to the compositional nature of natural language, it is likely that there are many more sequences that make sense than there are sequences in the training corpus. Thus, if we want an understanding of natural language as a whole we must be able to extrapolate meaning from sequence we have never seen before.
There are several approaches to avoiding these zero events. One family of approaches is called smoothing, where every possible sequence is provided some small probability mass. An example of smoothing is called additive smoothing where zero probabilities are avoided by assuming that each event occurs at least \(\alpha\) times in addition to its observed occurrences in the corpus. The MLE estimate is modified as
\[\hat p (w_{i+1} \mid w_{i-k:i}) = \frac{\countf{w_{i-k:i+1}} + \alpha}{\countf{w_{i-k:i}} + \alpha v}\]
where \(v\) is the size of the token vocabulary and \(0 \lt \alpha \leq 1\). Another family of approaches is using back-off, where if a sequence \(w_{i-k:i}\) is not observed, the model falls back to using \(w_{i-k-1:i}\).