Using feed-forward neural networks were first seriously applied to modeling natural language by Yoshua Bengio in 2003 . Bengio used a feed-forward network, where each word in the vocabulary is mapped to a \(m\)-dimensional vector in a continuous vector space.

The paper is light on the details of the mapping \(C : V \to \mathbb R^m\). fills in some of the details, but still doesn't explicitly explain how \(C\) is constructed. It seems to indicate that \(C\) is *learned* at the same time as the rest of the network? So does that means that the actual inputs are the word indices?

Then each word in the sequence \(w_{i-k:i}\) is mapped to their corresponding vectors, which are then concatenated to form the \(k \cdot m\)-dimensional input vector for the neural network.

The softmax output layer shown above is the most computational aspect of Bengio's language model — given a vocabulary of size \(v\), one softmax computation requires a matrix-vector multiplication with a matrix sized \(d_\text{hidden} \times v\), followed by \(v\) exponentiations. This makes using large vocabularies prohibitively expensive .