Using feed-forward neural networks were first seriously applied to modeling natural language by Yoshua Bengio in 2003 . Bengio used a feed-forward network, where each word in the vocabulary is mapped to a \(m\)-dimensional vector in a continuous vector space.
Then each word in the sequence \(w_{i-k:i}\) is mapped to their corresponding vectors, which are then concatenated to form the \(k \cdot m\)-dimensional input vector for the neural network.
The softmax output layer shown above is the most computational aspect of Bengio's language model — given a vocabulary of size \(v\), one softmax computation requires a matrix-vector multiplication with a matrix sized \(d_\text{hidden} \times v\), followed by \(v\) exponentiations. This makes using large vocabularies prohibitively expensive .