lmp.model._lstm_1997
#
LSTM (1997 version) language model.
- class lmp.model._lstm_1997.LSTM1997(*, d_blk: int = 1, d_emb: int = 1, init_ib: float = -1.0, init_lower: float = -0.1, init_ob: float = -1.0, init_upper: float = 0.1, label_smoothing: float = 0.0, n_blk: int = 1, n_lyr: int = 1, p_emb: float = 0.0, p_hid: float = 0.0, tknzr: BaseTknzr, **kwargs: Any)[source]#
Bases:
BaseModel
LSTM (1997 version) 1 language model.
Let \(x\) be batch of token ids with batch size \(B\) and per sequence length \(S\).
Let \(V\) be the vocabulary size of the paired tokenizer. Each token id represents an unique token, i.e., \(x_t \in \set{1, \dots, V}\).
Let \(E\) be the token embedding lookup table.
Let \(\dEmb\) be the dimension of token embeddings.
Let \(e_t\) be the token embedding correspond to token id \(x_t\).
Token embeddings have dropout probability \(\pEmb\).
Let \(\nLyr\) be the number of recurrent layers.
Let \(\dBlk\) be the number of units in a memory cell block.
Let \(\nBlk\) be the number of memory cell blocks.
Let \(\dHid = \nBlk \times \dBlk\).
Let \(h^\ell\) be the hidden states of the \(\ell\) th recurrent layer.
Let \(h_t^\ell\) be the \(t\) th time step of \(h^\ell\).
The initial hidden states \(h_0^\ell\) are given as input.
Hidden states have dropout probability \(\pHid\).
Let \(c^\ell\) be the memory cell internal states of the \(\ell\) th recurrent layer.
let \(c_t^\ell\) be the \(t\) th time step of \(c^\ell\).
The memory cell initial internal states \(c_0^\ell\) are given as input.
LSTM (1997 version) language model is defined as follow:
\[\begin{split}\begin{align*} & \algoProc{\LSTMNineSeven}\pa{x, \pa{\br{c_0^1, \dots, c_0^{\nLyr}}, \br{h_0^1, \dots, h_0^{\nLyr}}}} \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} e_t \algoEq (x_t)\text{-th row of } E \text{ but treated as column vector} \\ & \indent{2} \widehat{e_t} \algoEq \drop{e_t}{\pEmb} \\ & \indent{2} h_t^0 \algoEq \tanh\pa{W_h \cdot \widehat{e_t} + b_h} \\ & \indent{1} \algoEndFor \\ & \indent{1} h^0 \algoEq \cat{h_1^0, \dots, h_S^0} \\ & \indent{1} \widehat{h^0} \algoEq \drop{h^0}{\pHid} \\ & \indent{1} \algoFor{\ell \in \set{1, \dots, \nLyr}} \\ & \indent{2} \pa{c^\ell, h^\ell} \algoEq \LSTMNineSevenLayer\pa{ x \algoEq \widehat{h^{\ell-1}}, c_0 \algoEq c_0^\ell, h_0 \algoEq h_0^\ell } \\ & \indent{2} \widehat{h^\ell} \algoEq \drop{h^\ell}{\pHid} \\ & \indent{1} \algoEndFor \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} z_t \algoEq \tanh\pa{W_z \cdot h_t^{\nLyr} + b_z} \\ & \indent{2} \widehat{z_t} \algoEq \drop{z_t}{\pHid} \\ & \indent{2} y_t \algoEq \sof{E \cdot \widehat{z_t}} \\ & \indent{1} \algoEndFor \\ & \indent{1} y \algoEq \cat{y_1, \dots, y_S} \\ & \indent{1} \algoReturn \pa{y, \pa{\br{c_S^1, \dots, c_S^{\nLyr}}, \br{h_S^1, \dots, h_S^{\nLyr}}}} \\ & \algoEndProc \end{align*}\end{split}\]Trainable Parameters
Nodes
Parameter
Shape
Symbol
Shape
\(E\)
\((V, \dEmb)\)
\(c^\ell\)
\((B, S, \nBlk, \dBlk)\)
\(W_h\)
\((\dHid, \dEmb)\)
\(c_t^\ell\)
\((B, \nBlk, \dBlk)\)
\(W_z\)
\((\dEmb, \dHid)\)
\(e_t\)
\((B, S, \dEmb)\)
\(b_h\)
\((\dHid)\)
\(h^\ell\)
\((B, S, \dHid)\)
\(b_z\)
\((\dEmb)\)
\(h_t^\ell\)
\((B, \dHid)\)
\(\LSTMNineSevenLayer\)
\(\widehat{h^\ell}\)
\((B, \dHid)\)
\(x\)
\((B, S)\)
\(x_t\)
\((B)\)
\(y\)
\((B, S, V)\)
\(y_t\)
\((B, V)\)
\(z_t\)
\((B, \dEmb)\)
\(\widehat{z_t}\)
\((B, \dEmb)\)
The goal of optimization is to minimize the negative logliklihood of next token id \(x_{t+1}\) given \(y_t\). The prediction loss is defined to be the average negative logliklihood over \(x\) given \(y\).
\[\loss = \dfrac{-1}{S} \sum_{t = 1}^S \log \Pr(x_{t+1} \vert y_t).\]\(z_t\) is obtained by transforming \(h_t^{\nLyr}\) from dimension \(\dHid\) to \(\dEmb\). This is only need for shape consistency: the hidden states \(h_t^{\nLyr}\) has shape \((B, \dHid)\) and \(E\) has shape \((V, \dEmb)\).
\(y_t\) is the next token id prediction probability distribution over tokenizer’s vocabulary. We use inner product to calculate similarity scores over all token ids, and then use softmax to normalize similarity scores into probability range \([0, 1]\).
The calculations after hidden states are the same as
ElmanNet
.Model parameters in LSTM (1997 version) are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) of uniform distribution are given as hyperparameters.
Input gate biases are initialized with uniform distribution \(\mathcal{U}(\init_{ib}, 0)\). This make input gate remain closed at the start of training.
Output gate biases are initialized with uniform distribution \(\mathcal{U}(\init_{ob}, 0)\). This make output gate remain closed at the start of training.
- Parameters
d_blk (int, default: 1) – Number of units in a memory cell block \(\dBlk\).
d_emb (int, default: 1) – Token embedding dimension \(\dEmb\).
init_ib (float, default: -1.0) – Uniform distribution lower bound \(\init_{ib}\) used to initialize input gate biases.
init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
init_ob (float, default: -1.0) – Uniform distribution lower bound \(\init_{ob}\) used to initialize output gate biases.
init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.
label_smoothing (float, default: 0.0) – Smoothing applied on prediction target \(x_{t+1}\).
n_blk (int, default: 1) – Number of memory cell blocks \(\nBlk\).
n_lyr (int, default: 1) – Number of recurrent layers \(\nLyr\).
p_emb (float, default: 0.0) – Embeddings dropout probability \(\pEmb\).
p_hid (float, default: 0.0) – Hidden units dropout probability \(\pHid\).
tknzr (BaseTknzr) – Tokenizer instance.
- emb#
Token embedding lookup table \(E\). Input shape: \((B, S)\). Output shape: \((B, S, \dEmb)\).
- Type
- fc_e2h#
Fully connected layer \(W_h\) and \(b_h\) which connects input units to the 1st recurrent layer’s input. Dropout with probability \(\pEmb\) is applied to input. Dropout with probability \(\pHid\) is applied to output. Input shape: \((B, S, \dEmb)\). Output shape: \((B, S, \dHid)\).
- Type
- fc_h2e#
Fully connected layer \(W_z\) and \(b_z\) which transforms hidden states to next token embeddings. Dropout with probability \(\pHid\) is applied to output. Input shape: \((B, S, \dHid)\). Output shape: \((B, S, \dEmb)\).
- Type
- init_ib#
Uniform distribution lower bound \(\init_{ib}\) used to initialize input gate biases.
- Type
- init_lower#
Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
- Type
- init_ob#
Uniform distribution lower bound \(\init_{ob}\) used to initialize output gate biases.
- Type
- init_upper#
Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
- Type
- stack_rnn#
LSTM1997Layer
stacking layers. Each LSTM (1997 version) layer is followed by a dropout layer with probability \(\pHid\). The number of stacking layers is equal to \(2 \nLyr\). Input shape: \((B, S, \dHid)\). Output shape: \((B, S, \dHid)\).- Type
See also
ElmanNet
Elman Net language model.
LSTM1997Layer
LSTM (1997 version) recurrent neural network.
- classmethod add_CLI_args(parser: ArgumentParser) None [source]#
Add LSTM (1997 version) language model hyperparameters to CLI argument parser.
- Parameters
parser (argparse.ArgumentParser) – CLI argument parser.
- Return type
None
See also
- lmp.script.train_model
Language model training script.
Examples
>>> import argparse >>> import math >>> from lmp.model import LSTM1997 >>> parser = argparse.ArgumentParser() >>> LSTM1997.add_CLI_args(parser) >>> args = parser.parse_args([ ... '--d_blk', '64', ... '--d_emb', '100', ... '--init_ib', '-0.1', ... '--init_lower', '-0.01', ... '--init_ob', '-0.1', ... '--init_upper', '0.01', ... '--label_smoothing', '0.1', ... '--n_blk', '8', ... '--n_lyr', '2', ... '--p_emb', '0.5', ... '--p_hid', '0.1', ... ]) >>> assert args.d_blk == 64 >>> assert args.d_emb == 100 >>> assert math.isclose(args.init_ib, -0.1) >>> assert math.isclose(args.init_lower, -0.01) >>> assert math.isclose(args.init_ob, -0.1) >>> assert math.isclose(args.init_upper, 0.01) >>> assert math.isclose(args.label_smoothing, 0.1) >>> assert args.n_blk == 8 >>> assert args.n_lyr == 2 >>> assert math.isclose(args.p_emb, 0.5) >>> assert math.isclose(args.p_hid, 0.1)
- cal_loss(batch_cur_tkids: Tensor, batch_next_tkids: Tensor, batch_prev_states: Optional[Tuple[List[Tensor], List[Tensor]]] = None) Tuple[Tensor, Tuple[List[Tensor], List[Tensor]]] [source]#
Calculate language model prediction loss.
We use cross entropy loss as our training objective. This method is only used for training.
- Parameters
batch_cur_tkids (torch.Tensor) – Batch current input token ids.
batch_cur_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_next_tkids (torch.Tensor) – Prediction target of each sample in the batch.
batch_next_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_prev_states (Optional[tuple[list[torch.Tensor], list[torch.Tensor]]], default: None) – The first tensor list in the tuple is a batch of memory cell previous internal states. There are \(\nLyr\) tensors in the list, each has shape \((B, \nBlk, \dBlk)\) and
dtype == torch.float
. The second tensor list in the tuple is a batch of previous hidden states. There are \(\nLyr\) tensors in the list, each has shape \((B, \dHid)\) anddtype == torch.float
. Set toNone
to use the initial hidden states and memory cell initial internal states of each layer.
- Returns
The first item in the tuple is the mini-batch cross-entropy loss. Loss tensor has shape \((1)\) and
dtype == torch.float
. The second item in the tuple is a tuple containing two tensor lists. The first tensor list represents the memory cell last internal states of each recurrent layer derived from current input token ids. Each tensor in the first list has shape \((B, \nBlk, \dBlk)\) anddtype == torch.float
. The second tensor list represents the last hidden states of each recurrent layer derived from current input token ids. Each tensor in the second list has shape \((B, \dHid)\) anddtype == torch.float
. Both structure are the same asbatch_prev_states
.- Return type
tuple[torch.Tensor, tuple[list[torch.Tensor], list[torch.Tensor]]]
- forward(batch_cur_tkids: Tensor, batch_prev_states: Optional[Tuple[List[Tensor], List[Tensor]]] = None) Tuple[Tensor, Tuple[List[Tensor], List[Tensor]]] [source]#
Calculate next token id logits.
Logits were calculated based on previous hidden states, memory cell previous internal states and and current input token ids. Use
pred
to convert logits into next token id probability distribution over tokenizer’s vocabulary. Usecal_loss
to convert logits into next token id prediction loss. Below we describe the forward pass algorithm of LSTM (1997 version) language model.Use token ids to lookup token embeddings with
self.emb
.Use
self.fc_e2h
to transform token embeddings into 1st recurrent layer’s input.Feed transformation result into recurrent layer and output hidden states. We use teacher forcing in this step when perform training, i.e., inputs are directly given instead of generated by model.
Feed the output of previous recurrent layer into next recurrent layer until all layers have been used once.
Use
self.fc_h2e
to transform last recurrent layer’s hidden states to next token embeddings.Perform inner product on token embeddings over tokenizer’s vocabulary to get similarity scores.
Return similarity scores (logits).
- Parameters
batch_cur_tkids (torch.Tensor) – Batch current input token ids.
batch_cur_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_prev_states (Optional[tuple[list[torch.Tensor], list[torch.Tensor]]], default: None) – The first tensor list in the tuple is a batch of memory cell previous internal states. There are \(\nLyr\) tensors in the list, each has shape \((B, \nBlk, \dBlk)\) and
dtype == torch.float
. The second tensor list in the tuple is a batch of previous hidden states. There are \(\nLyr\) tensors in the list, each has shape \((B, \dHid)\) anddtype == torch.float
. Set toNone
to use the initial hidden states and memory cell initial internal states of each layer.
- Returns
The first item in the tuple is the batch of next token id logits with shape \((B, S, V)\) and
dtype == torch.float
. The second item in the tuple is a tuple containing two tensor lists. The first tensor list represents the memory cell last internal states of each recurrent layer derived from current input token ids. Each tensor in the first list has shape \((B, \nBlk, \dBlk)\) anddtype == torch.float
. The second tensor list represents the last hidden states of each recurrent layer derived from current input token ids. Each tensor in the second list has shape \((B, \dHid)\) anddtype == torch.float
. Both structure are the same asbatch_prev_states
.- Return type
tuple[torch.Tensor, tuple[list[torch.Tensor], list[torch.Tensor]]]
- params_init() None [source]#
Initialize model parameters.
All weights and biases other than input / output gate biases are initialized with uniform distribution \(\mathcal{U}\pa{\init_l, \init_u}\). Input gate biases are initialized with uniform distribution \(\mathcal{U}\pa{\init_{ib}, 0}\). Output gate biases are initialized with uniform distribution \(\mathcal{U}\pa{\init_{ob}, 0}\).
- Return type
None
See also
params_init
LSTM (1997 version) layer parameter initialization.
- pred(batch_cur_tkids: Tensor, batch_prev_states: Optional[Tuple[List[Tensor], List[Tensor]]] = None) Tuple[Tensor, Tuple[List[Tensor], List[Tensor]]] [source]#
Calculate next token id probability distribution over tokenizer’s vocabulary.
Probabilities were calculated based on previous hidden states, memory cell previous internal states and current input token id. This method is only used for inference. No tensor graphs are constructed and no gradients are calculated.
- Parameters
batch_cur_tkids (torch.Tensor) – Batch current input token ids.
batch_cur_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_prev_states (Optional[tuple[list[torch.Tensor], list[torch.Tensor]]], default: None) – The first tensor list in the tuple is a batch of memory cell previous internal states. There are \(\nLyr\) tensors in the list, each has shape \((B, \nBlk, \dBlk)\) and
dtype == torch.float
. The second tensor list in the tuple is a batch of previous hidden states. There are \(\nLyr\) tensors in the list, each has shape \((B, \dHid)\) anddtype == torch.float
. Set toNone
to use the initial hidden states and memory cell initial internal states of each layer.
- Returns
The first item in the tuple is the batch of next token id probability distributions over the tokenizer’s vocabulary. Probability tensor has shape \((B, S, V)\) and
dtype == torch.float
. The second item in the tuple is a tuple containing two tensor lists. The first tensor list represents the memory cell last internal states of each recurrent layer derived from current input token ids. Each tensor in the first list has shape \((B, \nBlk, \dBlk)\) anddtype == torch.float
. The second tensor list represents the last hidden states of each recurrent layer derived from current input token ids. Each tensor in the second list has shape \((B, \dHid)\) anddtype == torch.float
. Both structure are the same asbatch_prev_states
.- Return type
tuple[torch.Tensor, tuple[list[torch.Tensor], list[torch.Tensor]]]
- class lmp.model._lstm_1997.LSTM1997Layer(*, d_blk: int = 1, in_feat: int = 1, init_ib: float = -1.0, init_lower: float = -0.1, init_ob: float = -1.0, init_upper: float = 0.1, n_blk: int = 1, **kwargs: Any)[source]#
Bases:
Module
LSTM (1997 version) 1 recurrent neural network.
Let \(\hIn\) be the number of input features per time step.
Let \(\dBlk\) be the number of units in a memory cell block.
Let \(\nBlk\) be the number of memory cell blocks.
Let \(\hOut = \nBlk \times \dBlk\) be the number of output features per time step.
Let \(x\) be a batch of sequences of input features with shape \((B, S, \hIn)\), where \(B\) is batch size and \(S\) is per sequence length.
Let \(h_0\) be the initial hidden states with shape \((B, \hOut)\).
Let \(c_0\) be the memory cell initial internal states with shape \((B, \nBlk, \dBlk)\).
LSTM (1997 version) layer is defined as follow:
\[\begin{split}\begin{align*} & \algoProc{\LSTMNineSevenLayer}\pa{x, c_0, h_0} \\ & \indent{1} S \algoEq x.\sz{1} \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} i_t \algoEq \sigma\pa{W_i \cdot x_t + U_i \cdot h_{t-1} + b_i} \\ & \indent{2} o_t \algoEq \sigma\pa{W_o \cdot x_t + U_o \cdot h_{t-1} + b_o} \\ & \indent{2} \algoFor{k \in \set{1, \dots, \nBlk}} \\ & \indent{3} g_{t,k} \algoEq \tanh\pa{W_k \cdot x_t + U_k \cdot h_t + b_k} &&\tag{1}\label{1} \\ & \indent{3} c_{t,k} \algoEq c_{t-1,k} + i_{t,k} \cdot g_{t,k} \\ & \indent{3} h_{t,k} \algoEq o_{t,k} \cdot \tanh\pa{c_{t,k}} &&\tag{2}\label{2} \\ & \indent{2} \algoEndFor \\ & \indent{2} c_t \algoEq \cat{c_{t,1}, \dots, c_{t,\nBlk}} \\ & \indent{2} h_t \algoEq \fla{h_{t,1}, \dots, h_{t,\nBlk}} \\ & \indent{1} \algoEndFor \\ & \indent{1} c \algoEq \cat{c_1, \dots, c_S} \\ & \indent{1} h \algoEq \cat{h_1, \dots, h_S} \\ & \indent{1} \algoReturn (c, h) \\ & \algoEndProc \end{align*}\end{split}\]Trainable Parameters
Nodes
Parameter
Shape
Symbol
Shape
\(U_i\)
\((\nBlk, \hOut)\)
\(c\)
\((B, S, \nBlk, \dBlk)\)
\(U_k\)
\((\dBlk, \hOut)\)
\(c_t\)
\((B, \nBlk, \dBlk)\)
\(U_o\)
\((\nBlk, \hOut)\)
\(c_{t,k}\)
\((B, \dBlk)\)
\(W_i\)
\((\nBlk, \hIn)\)
\(g_{t,k}\)
\((B, \dBlk)\)
\(W_k\)
\((\dBlk, \hIn)\)
\(h\)
\((B, S, \hOut)\)
\(W_o\)
\((\nBlk, \hIn)\)
\(h_t\)
\((B, \hOut)\)
\(b_i\)
\((\nBlk)\)
\(h_{t,k}\)
\((B, \dBlk)\)
\(b_o\)
\((\nBlk)\)
\(i_t\)
\((B, \nBlk)\)
\(b_k\)
\((\dBlk)\)
\(i_{t,k}\)
\((B, 1)\)
\(o_t\)
\((B, \nBlk)\)
\(o_{t,k}\)
\((B, 1)\)
\(x\)
\((B, S, \hIn)\)
\(x_t\)
\((B, \hIn)\)
\(i_t\) is memory cell blocks’ input gate units at time step \(t\). \(i_{t,k}\) is the \(k\)-th coordinates of \(i_t\), which represents the \(k\)-th memory cell block’s input gate unit at time step \(t\).
\(o_t\) is memory cell blocks’ output gate units at time step \(t\). \(o_{t,k}\) is the \(k\)-th coordinates of \(o_t\), which represents the \(k\)-th memory cell block’s output gate unit at time step \(t\).
The \(k\)-th memory cell block is consist of:
Current input features \(x_t\).
Previous hidden states \(h_{t-1}\).
Input activations \(g_{t,k}\).
A input gate unit \(i_{t,k}\).
A output gate unit \(o_{t,k}\).
Memory cell previous internal states \(c_{t-1,k}\) and memory cell current internal states \(c_{t,k}\).
Output units \(h_{t,k}\).
All memory cell current internal states at time step \(t\) are concatenated to form \(c_t\).
All memory cell output units at time step \(t\) are flattened to form \(h_t\).
Our implementation use \(\tanh\) as memory cell blocks’ input activation function. The implementation in the paper use \(4 \sigma - 2\) in \(\eqref{1}\) and \(2 \sigma - 1\) in \(\eqref{2}\). We argue that the change in \(\eqref{1}\) does not greatly affect the computation result and \(\eqref{2}\) is almost the same as the paper implementation. To be precise, \(\tanh(x) = 2 \sigma(2x) - 1\). The formula \(2 \sigma(x) - 1\) has gradient \(2 \sigma(x) (1 - \sigma(x))\). The formula \(\tanh(x)\) has gradient \(4 \sigma(2x) (1 - \sigma(2x))\). Intuitively using \(\tanh\) will scale gradient by 4.
Model parameters in LSTM (1997 version) layer are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) are given as hyperparameters.
Input gate biases are initialized with uniform distribution \(\mathcal{U}(\init_{ib}, 0)\). The lower bound \(\init_{ib}\) is given as hyperparameter. This make input gate remain closed at the start of training.
Output gate biases are initialized with uniform distribution \(\mathcal{U}(\init_{ob}, 0)\). The lower bound \(\init_{ob}\) is given as hyperparameter. This make output gate remain closed at the start of training.
- Parameters
d_blk (int, default: 1) – Dimension of each memory cell block \(\dBlk\).
in_feat (int, default: 1) – Number of input features per time step \(\hIn\).
init_ib (float, default: -1.0) – Uniform distribution lower bound \(\init_{ib}\) used to initialize input gate biases.
init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
init_ob (float, default: -1.0) – Uniform distribution lower bound \(\init_{ob}\) used to initialize output gate biases.
init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
n_blk (int, default: 1) – Number of memory cell blocks \(\nBlk\).
kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.
- c_0#
Memory cell blocks’ initial internal states \(c_0\). Shape: \((1, \nBlk, \dBlk)\).
- Type
- fc_h2ig#
Fully connected layer \(U_i\) which connects hidden states to memory cell’s input gate units. Input shape: \((B, \dHid)\). Output shape: \((B, \nBlk)\).
- Type
- fc_h2mc_in#
Fully connected layers \(\pa{U_1, \dots, U_{\nBlk}}\) which connect hidden states to memory cell blocks’ input activations. Input shape: \((B, \dHid)\). Output shape: \((B, \dHid)\).
- Type
- fc_h2og#
Fully connected layer \(U_o\) which connects hidden states to memory cell’s output gate units. Input shape: \((B, \dHid)\). Output shape: \((B, \nBlk)\).
- Type
- fc_x2ig#
Fully connected layer \(W_i\) and \(b_i\) which connects input units to memory cell’s input gate units. Input shape: \((B, S, \hIn)\). Output shape: \((B, S, \nBlk)\).
- Type
- fc_x2mc_in#
Fully connected layers \(\pa{W_1, \dots, W_{\nBlk}}\) and \(\pa{b_1, \dots, b_{\nBlk}}\) which connects input units to memory cell blocks’ input activations. Input shape: \((B, S, \hIn)\). Output shape: \((B, S, \dHid)\).
- Type
- fc_x2og#
Fully connected layer \(W_o\) and \(b_o\) which connects input units to memory cell’s output gate units. Input shape: \((B, S, \hIn)\). Output shape: \((B, S, \nBlk)\).
- Type
- h_0#
Initial hidden states \(h_0\). Shape: \((1, \dHid)\)
- Type
- init_ib#
Uniform distribution lower bound \(\init_{ib}\) used to initialize input gate biases.
- Type
- init_lower#
Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
- Type
- init_ob#
Uniform distribution lower bound \(\init_{ob}\) used to initialize output gate biases.
- Type
- init_upper#
Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
- Type
- forward(x: Tensor, c_0: Optional[Tensor] = None, h_0: Optional[Tensor] = None) Tuple[Tensor, Tensor] [source]#
Calculate batch of hidden states for
x
.Below we describe the forward pass algorithm of LSTM (1997 version) layer.
Let
x
be a batch of sequences of input features \(x\).Let
x.size(1)
be sequence length \(S\).Let
c_0
be the memory cell initial internal states \(c_0\). Ifc_0 is None
, useself.c_0
instead.Let
h_0
be the initial hidden states \(h_0\). Ifh_0 is None
, useself.h_0
instead.Loop through \(\set{1, \dots, S}\) with looping index \(t\).
Use \(x_t\), \(h_{t-1}\),
self.fc_x2ig
andself.fc_h2ig
to get input gate units \(i_t\).Use \(x_t\), \(h_{t-1}\),
self.fc_x2og
andself.fc_h2og
to get output gate units \(o_t\).Use \(x_t\), \(h_{t-1}\),
self.fc_x2mc_in
andself.fc_h2mc_in
to get memory cell input activations \(g_{t,1}, \dots, g_{t,\nBlk}\).Derive memory cell new internal state \(c_{t,1}, \dots, c_{t,\nBlk}\) using input gate units \(i_{t,1}, \dots, i_{t,\nBlk}\), memory cell old internal states \(c_{t-1,1}, \dots, c_{t-1,\nBlk}\) and memory cell input activations \(g_{t,1}, \dots, g_{t,\nBlk}\).
Derive new hidden states \(h_t\) using output gate units \(o_{t,1}, \dots, o_{t,\nBlk}\) and memory cell new internal states \(c_{t,1}, \dots, c_{t,\nBlk}\).
Denote the concatenation of memory cell internal states \(c_1, \dots, c_S\) as \(c\).
Denote the concatenation of hidden states \(h_1, \dots, h_S\) as \(h\).
Return \((c, h)\).
- Parameters
x (torch.Tensor) – Batch of sequences of input features.
x
has shape \((B, S, \hIn)\) anddtype == torch.float
.c_0 (Optional[torch.Tensor], default: None) – Batch of memory cell previous internal states. The tensor has shape \((B, \nBlk, \dBlk)\) and
dtype == torch.float
. Set toNone
to use the initial memory internal stateself.c_0
.h_0 (Optional[torch.Tensor], default: None) – Batch of previous hidden states. The tensor has shape \((B, \hOut)\) and
dtype == torch.float
. Set toNone
to use the initial hidden statesself.h_0
.
- Returns
The first tensor is batch of memory cell internal states and the second tensor is batch of hidden states. Batch memory cell internal states has shape \((B, S, \nBlk, \dBlk)\) and
dtype == torch.float
. Batch hidden states has shape \((B, S, \hOut)\) anddtype == torch.float
.- Return type
- params_init() None [source]#
Initialize model parameters.
All weights and biases other than \(b_i, b_o\) are initialized with uniform distribution \(\mathcal{U}\pa{\init_l, \init_u}\). \(b_i\) is initialized with uniform distribution \(\mathcal{U}\pa{\init_{ib}, 0}\). \(b_o\) is initialized with uniform distribution \(\mathcal{U}\pa{\init_{ob}, 0}\). \(b_i, b_o\) are initialized separatedly so that input and output gates remain closed at the start of training.
- Return type
None
- 1(1,2)
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. URL: https://ieeexplore.ieee.org/abstract/document/6795963, doi:10.1162/neco.1997.9.8.1735.