1.d.iv. Recurrent Neural Network Models

For several years, Bengio's feed-forward language model shown in Feed Forward Language Models was the state-of-the-art neural network language model. Then in 2010, Tomas Mikolov proposed a recurrent neural network (RNN) architecture that made major advancements in model performance.

The use of a recurrent network over the standard feed-forward models has several advantages . Importantly, they allow representing arbitrary-length sequences as fixed-size vectors. This is of particular importance in modeling natural language, where the lengths of sentences are quite varied. RNNs also abandon the \(k\)-th order Markov assumption that Bengio's FFNN language model follows. The use of recurrent networks allows the output probabilities of a neural network language model to be conditioned on the entire preceding context, instead of a fixed \(k\) number of preceding tokens.

Recurrent architectures are able to do by by inputting a single word at a time to the network, rather than a collection of \(k\)-words as in Feed Forward Language Models. Doing so allows recurrent networks to consider the past context of any words previously input to the network by storing internal context vectors, and then using the last word's context as an input to the network, in addition to the next word.

Mikolov's recurrent language model
Mikolov's recurrent language model

Mikolov's architecture is shown above. The word \(W(t)\) is converted to a one-hot encoding \(x(t)\), which is then concatenated with the context \(C(t-1)\) from the previous time step, and used as the input to a hidden context layer. The context layer is then fed to the next iteration's input and the current iteration's softmax output layer to output a probability distribution for the next word \(W(t+1)\). Mikolov notes that the size of the hidden context layer should be proportional to the size of the data set. This results in training times that increase more than linearly with the size of the training set. Another problem with RNN language models is that the gradients in RNN models tend to vanish and explode during training .