In light of the vanishing gradients of RNN models, the use of a Long Short Term Memory (LSTM) network architecture has been found to be effective . LSTM models are a class of recurrent networks. The figure below shows the basic structure of one layer in a recurrent neural network. Notice that the architecture is now temporal — each input and output is labeled with a timestep — and that the key difference from the basic layers of a feed forward network is that the recurrent layer also considers \(y(t-1)\) in addition to the input values \(x(t)\) when producing \(y(t)\).
As the next figure shows, LSTM layers have many of the same characteristics — a single vector \(x(t)\) input, a single vector \(y(t)\) output, and the general recurrent property that internal context wraps back around as input for the next time step.
In an RNN model, the context that wraps around is the output vector \(y(t)\), but in LSTM models, both the output \(y(t)\) and a new, hidden context vector \(c(t)\) wraps around. LSTM models also differ in the addition of several gates to control how much context to remember, when it should forget the context, and when it should output the value of the activation function. As this is a slightly more involved architecture, some amount of explanatory remarks are necessary.
LSTM networks have an internal state called \(c(t)\), and an output \(y(t)\). The internal state is recurrently passed from the output of the layer to its input on the next time step, as is the output \(y(t)\). However, the output is also passed to the next layer in the network.
The internal state can be modified by the network, in a manner that is controlled by several different gates. Each gate, is a normal feed-forward network layer with a sigmoid activation function. The layer to learn new features is a normal feed-forward layer too, but its activation function is most often the hyperbolic tangent.
Since the sigmoid function outputs values between 0 and 1, and each of the gates are immediately followed by an element-wise multiplication (\(\otimes\)), the gates are aptly named --- as they control how much of a signal passes. The forget gate controls how much of the previous timestep's internal state is remembered, while the input gate controls how much of the new features are added element-wise (\(\oplus\)) to the internal state. Finally, the internal state is passed element-wise through a hyperbolic tangent activation function before being multiplied by the results of the output gate.
We can formalize this architecture mathematically as
\[\begin{aligned} f_g(t) & = \sigma\left(W_f \cdot [y(t-1), x(t)] + b_f\right) & \text{forget gate} \\ i_g(t) & = \sigma\left(W_i \cdot [y(t-1), x(t)] + b_i\right) & \text{input gate} \\ o_g(t) & = \sigma\left(W_o \cdot [y(t-1), x(t)] + b_o\right) & \text{output gate} \\ g(t) & = \tanh \left(U \cdot [y(t-1), x(t)] + d \right) & \text{new features} \\ c(t) & = \left(f_g(t) \otimes c(t-1)\right) \oplus \left(i_g(t) \otimes g(t)\right) & \text{internal state} \\ y(t) & = o_g(t) \otimes \tanh\left(c(t)\right) & \text{output} \end{aligned}\]
where \([y(t-1), x(t)]\) is the concatenation of the \(y(t-1)\) and \(x(t)\) vectors, \(\otimes\) and \(\oplus\) are element-wise multiplication and addition respectively, \(W_f, W_i, W_o\), and \(U\) are weight matrices, and \(b_f, b_i, b_o\), and \(d\) are bias vectors .
The application of LSTM networks to modeling natural language is a relatively recent development . The figure below shows the proposed neural network topology.
The input vector \(\vec x\) is one-hot encoded, which means that, given a vocabulary of size \(n\), \(\vec x\) is \(n\)-dimensional and zero everywhere except for the index corresponding to the word \(\vec x\) represents where its value is one. This is a common word representation for neural network models, and in the implementations of the network training and computation the supporting linear algebra and optimization libraries are free to abuse the sparse nature of the input vector.
The first hidden layer, in effect, projects the one-hot encoded input vector to a continuous vector space. Then this continuous word representation is passed through the LSTM network that is able to keep track of the context between different \(\vec x\) vectors (each representing a word in the vocabulary) fed to the network. Finally, note that the softmax output layer outputs a probability value for each word in the vocabulary, making it the dominating computational factor in the network training .