1.d.iii. Feed-Forward Neural Network Models

Using feed-forward neural networks were first seriously applied to modeling natural language by Yoshua Bengio in 2003 . Bengio used a feed-forward network, where each word in the vocabulary is mapped to a \(m\)-dimensional vector in a continuous vector space.

The paper is light on the details of the mapping \(C : V \to \mathbb R^m\). fills in some of the details, but still doesn't explicitly explain how \(C\) is constructed. It seems to indicate that \(C\) is learned at the same time as the rest of the network? So does that means that the actual inputs are the word indices?

Then each word in the sequence \(w_{i-k:i}\) is mapped to their corresponding vectors, which are then concatenated to form the \(k \cdot m\)-dimensional input vector for the neural network.

Bengio's feed-forward language model
Bengio's feed-forward language model

The softmax output layer shown above is the most computational aspect of Bengio's language model — given a vocabulary of size \(v\), one softmax computation requires a matrix-vector multiplication with a matrix sized \(d_\text{hidden} \times v\), followed by \(v\) exponentiations. This makes using large vocabularies prohibitively expensive .