LSTM (2002 version)#

class lmp.model.LSTM2002(*, d_blk: int = 1, d_emb: int = 1, init_fb: float = 1.0, init_ib: float = -1.0, init_lower: float = -0.1, init_ob: float = -1.0, init_upper: float = 0.1, label_smoothing: float = 0.0, n_blk: int = 1, n_lyr: int = 1, p_emb: float = 0.0, p_hid: float = 0.0, tknzr: BaseTknzr, **kwargs: Any)[source]#

Bases: LSTM2000

LSTM (2002 version) 1 language model.

  • Let \(x\) be batch of token ids with batch size \(B\) and per sequence length \(S\).

  • Let \(V\) be the vocabulary size of the paired tokenizer. Each token id represents an unique token, i.e., \(x_t \in \set{1, \dots, V}\).

  • Let \(E\) be the token embedding lookup table.

    • Let \(\dEmb\) be the dimension of token embeddings.

    • Let \(e_t\) be the token embedding correspond to token id \(x_t\).

    • Token embeddings have dropout probability \(\pEmb\).

  • Let \(\nLyr\) be the number of recurrent layers.

  • Let \(\dBlk\) be the number of units in a memory cell block.

  • Let \(\nBlk\) be the number of memory cell blocks.

  • Let \(\dHid = \nBlk \times \dBlk\).

  • Let \(h^\ell\) be the hidden states of the \(\ell\) th recurrent layer.

    • Let \(h_t^\ell\) be the \(t\) th time step of \(h^\ell\).

    • The initial hidden states \(h_0^\ell\) are given as input.

    • Hidden states have dropout probability \(\pHid\).

  • Let \(c^\ell\) be the memory cell internal states of the \(\ell\) th recurrent layer.

    • let \(c_t^\ell\) be the \(t\) th time step of \(c^\ell\).

    • The memory cell initial internal states \(c_0^\ell\) are given as input.

LSTM (2002 version) language model is defined as follow:

\[\begin{split}\begin{align*} & \algoProc{\LSTMZeroTwo}\pa{x, \pa{\br{c_0^1, \dots, c_0^{\nLyr}}, \br{h_0^1, \dots, h_0^{\nLyr}}}} \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} e_t \algoEq (x_t)\text{-th row of } E \text{ but treated as column vector} \\ & \indent{2} \widehat{e_t} \algoEq \drop{e_t}{\pEmb} \\ & \indent{2} h_t^0 \algoEq \tanh\pa{W_h \cdot \widehat{e_t} + b_h} \\ & \indent{1} \algoEndFor \\ & \indent{1} h^0 \algoEq \cat{h_1^0, \dots, h_S^0} \\ & \indent{1} \widehat{h^0} \algoEq \drop{h^0}{\pHid} \\ & \indent{1} \algoFor{\ell \in \set{1, \dots, \nLyr}} \\ & \indent{2} \pa{c^\ell, h^\ell} \algoEq \LSTMZeroTwoLayer\pa{ x \algoEq \widehat{h^{\ell-1}}, c_0 \algoEq c_0^\ell, h_0 \algoEq h_0^\ell } \\ & \indent{2} \widehat{h^\ell} \algoEq \drop{h^\ell}{\pHid} \\ & \indent{1} \algoEndFor \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} z_t \algoEq \tanh\pa{W_z \cdot h_t^{\nLyr} + b_z} \\ & \indent{2} \widehat{z_t} \algoEq \drop{z_t}{\pHid} \\ & \indent{2} y_t \algoEq \sof{E \cdot \widehat{z_t}} \\ & \indent{1} \algoEndFor \\ & \indent{1} y \algoEq \cat{y_1, \dots, y_S} \\ & \indent{1} \algoReturn \pa{y, \pa{\br{c_S^1, \dots, c_S^{\nLyr}}, \br{h_S^1, \dots, h_S^{\nLyr}}}} \\ & \algoEndProc \end{align*}\end{split}\]

Trainable Parameters

Nodes

Parameter

Shape

Symbol

Shape

\(E\)

\((V, \dEmb)\)

\(c^\ell\)

\((B, S, \nBlk, \dBlk)\)

\(W_h\)

\((\dHid, \dEmb)\)

\(c_t^\ell\)

\((B, \nBlk, \dBlk)\)

\(W_z\)

\((\dEmb, \dHid)\)

\(e_t\)

\((B, S, \dEmb)\)

\(b_h\)

\((\dHid)\)

\(h^\ell\)

\((B, S, \dHid)\)

\(b_z\)

\((\dEmb)\)

\(h_t^\ell\)

\((B, \dHid)\)

\(\LSTMZeroTwoLayer\)

\(\widehat{h^\ell}\)

\((B, \dHid)\)

\(x\)

\((B, S)\)

\(x_t\)

\((B)\)

\(y\)

\((B, S, V)\)

\(y_t\)

\((B, V)\)

\(z_t\)

\((B, \dEmb)\)

\(\widehat{z_t}\)

\((B, \dEmb)\)

Parameters
  • d_blk (int, default: 1) – Number of units in a memory cell block \(\dBlk\).

  • d_emb (int, default: 1) – Token embedding dimension \(\dEmb\).

  • init_ib (float, default: -1.0) – Uniform distribution upper bound \(\init_{fb}\) used to initialize forget gate biases.

  • init_ib – Uniform distribution lower bound \(\init_{ib}\) used to initialize input gate biases.

  • init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

  • init_ob (float, default: -1.0) – Uniform distribution lower bound \(\init_{ob}\) used to initialize output gate biases.

  • init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

  • kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.

  • label_smoothing (float, default: 0.0) – Smoothing applied on prediction target \(x_{t+1}\).

  • n_blk (int, default: 1) – Number of memory cell blocks \(\nBlk\).

  • n_lyr (int, default: 1) – Number of recurrent layers \(\nLyr\).

  • p_emb (float, default: 0.0) – Embeddings dropout probability \(\pEmb\).

  • p_hid (float, default: 0.0) – Hidden units dropout probability \(\pHid\).

  • tknzr (BaseTknzr) – Tokenizer instance.

d_blk#

Number of units in a memory cell block \(\dBlk\).

Type

int

d_hid#

Total number of memory cell units \(\dHid\).

Type

int

emb#

Token embedding lookup table \(E\). Input shape: \((B, S)\). Output shape: \((B, S, \dEmb)\).

Type

torch.nn.Embedding

fc_e2h#

Fully connected layer \(W_h\) and \(b_h\) which connects input units to the 1st recurrent layer’s input. Dropout with probability \(\pEmb\) is applied to input. Dropout with probability \(\pHid\) is applied to output. Input shape: \((B, S, \dEmb)\). Output shape: \((B, S, \dHid)\).

Type

torch.nn.Sequential

fc_h2e#

Fully connected layer \(W_z\) and \(b_z\) which transforms hidden states to next token embeddings. Dropout with probability \(\pHid\) is applied to output. Input shape: \((B, S, \dHid)\). Output shape: \((B, S, \dEmb)\).

Type

torch.nn.Sequential

init_fb#

Uniform distribution upper bound \(\init_{fb}\) used to initialize forget gate biases.

Type

float

init_ib#

Uniform distribution lower bound \(\init_{ib}\) used to initialize input gate biases.

Type

float

init_lower#

Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

Type

float

init_ob#

Uniform distribution lower bound \(\init_{ob}\) used to initialize output gate biases.

Type

float

init_upper#

Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

Type

float

label_smoothing#

Smoothing applied on prediction target \(x_{t+1}\).

Type

float

model_name#

CLI name of LSTM (2002 version) is LSTM-2002.

Type

ClassVar[str]

n_blk#

Number of memory cell blocks \(\nBlk\).

Type

int

n_lyr#

Number of recurrent layers \(\nLyr\).

Type

int

p_emb#

Embeddings dropout probability \(\pEmb\).

Type

float

p_hid#

Hidden units dropout probability \(\pHid\).

Type

float

stack_rnn#

LSTM2002Layer stacking layers. Each LSTM (2002 version) layer is followed by a dropout layer with probability \(\pHid\). The number of stacking layers is equal to \(2 \nLyr\). Input shape: \((B, S, \dHid)\). Output shape: \((B, S, \dHid)\).

Type

torch.nn.ModuleList

See also

LSTM2000

LSTM (2000 version) language model.

LSTM2000Layer

LSTM (2000 version) recurrent neural network.

LSTM2002Layer

LSTM (2002 version) recurrent neural network.

classmethod add_CLI_args(parser: ArgumentParser) None[source]#

Add LSTM (2002 version) language model hyperparameters to CLI argument parser.

Parameters

parser (argparse.ArgumentParser) – CLI argument parser.

Return type

None

See also

lmp.script.train_model

Language model training script.

Examples

>>> import argparse
>>> import math
>>> from lmp.model import LSTM2002
>>> parser = argparse.ArgumentParser()
>>> LSTM2002.add_CLI_args(parser)
>>> args = parser.parse_args([
...   '--d_blk', '64',
...   '--d_emb', '100',
...   '--init_fb', '0.1',
...   '--init_ib', '-0.1',
...   '--init_lower', '-0.01',
...   '--init_ob', '-0.1',
...   '--init_upper', '0.01',
...   '--label_smoothing', '0.1',
...   '--n_blk', '8',
...   '--n_lyr', '2',
...   '--p_emb', '0.5',
...   '--p_hid', '0.1',
... ])
>>> assert args.d_blk == 64
>>> assert args.d_emb == 100
>>> assert math.isclose(args.init_fb, 0.1)
>>> assert math.isclose(args.init_ib, -0.1)
>>> assert math.isclose(args.init_lower, -0.01)
>>> assert math.isclose(args.init_ob, -0.1)
>>> assert math.isclose(args.init_upper, 0.01)
>>> assert math.isclose(args.label_smoothing, 0.1)
>>> assert args.n_blk == 8
>>> assert args.n_lyr == 2
>>> assert math.isclose(args.p_emb, 0.5)
>>> assert math.isclose(args.p_hid, 0.1)
cal_loss(batch_cur_tkids: Tensor, batch_next_tkids: Tensor, batch_prev_states: Optional[Tuple[List[Tensor], List[Tensor]]] = None) Tuple[Tensor, Tuple[List[Tensor], List[Tensor]]]#

Calculate language model prediction loss.

We use cross entropy loss as our training objective. This method is only used for training.

Parameters
  • batch_cur_tkids (torch.Tensor) – Batch current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.

  • batch_next_tkids (torch.Tensor) – Prediction target of each sample in the batch. batch_next_tkids has shape \((B, S)\) and dtype == torch.long.

  • batch_prev_states (Optional[tuple[list[torch.Tensor], list[torch.Tensor]]], default: None) – The first tensor list in the tuple is a batch of memory cell previous internal states. There are \(\nLyr\) tensors in the list, each has shape \((B, \nBlk, \dBlk)\) and dtype == torch.float. The second tensor list in the tuple is a batch of previous hidden states. There are \(\nLyr\) tensors in the list, each has shape \((B, \dHid)\) and dtype == torch.float. Set to None to use the initial hidden states and memory cell initial internal states of each layer.

Returns

The first item in the tuple is the mini-batch cross-entropy loss. Loss tensor has shape \((1)\) and dtype == torch.float. The second item in the tuple is a tuple containing two tensor lists. The first tensor list represents the memory cell last internal states of each recurrent layer derived from current input token ids. Each tensor in the first list has shape \((B, \nBlk, \dBlk)\) and dtype == torch.float. The second tensor list represents the last hidden states of each recurrent layer derived from current input token ids. Each tensor in the second list has shape \((B, \dHid)\) and dtype == torch.float. Both structure are the same as batch_prev_states.

Return type

tuple[torch.Tensor, tuple[list[torch.Tensor], list[torch.Tensor]]]

forward(batch_cur_tkids: Tensor, batch_prev_states: Optional[Tuple[List[Tensor], List[Tensor]]] = None) Tuple[Tensor, Tuple[List[Tensor], List[Tensor]]]#

Calculate next token id logits.

Logits were calculated based on previous hidden states, memory cell previous internal states and and current input token ids. Use pred to convert logits into next token id probability distribution over tokenizer’s vocabulary. Use cal_loss to convert logits into next token id prediction loss. Below we describe the forward pass algorithm of LSTM (1997 version) language model.

  1. Use token ids to lookup token embeddings with self.emb.

  2. Use self.fc_e2h to transform token embeddings into 1st recurrent layer’s input.

  3. Feed transformation result into recurrent layer and output hidden states. We use teacher forcing in this step when perform training, i.e., inputs are directly given instead of generated by model.

  4. Feed the output of previous recurrent layer into next recurrent layer until all layers have been used once.

  5. Use self.fc_h2e to transform last recurrent layer’s hidden states to next token embeddings.

  6. Perform inner product on token embeddings over tokenizer’s vocabulary to get similarity scores.

  7. Return similarity scores (logits).

Parameters
  • batch_cur_tkids (torch.Tensor) – Batch current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.

  • batch_prev_states (Optional[tuple[list[torch.Tensor], list[torch.Tensor]]], default: None) – The first tensor list in the tuple is a batch of memory cell previous internal states. There are \(\nLyr\) tensors in the list, each has shape \((B, \nBlk, \dBlk)\) and dtype == torch.float. The second tensor list in the tuple is a batch of previous hidden states. There are \(\nLyr\) tensors in the list, each has shape \((B, \dHid)\) and dtype == torch.float. Set to None to use the initial hidden states and memory cell initial internal states of each layer.

Returns

The first item in the tuple is the batch of next token id logits with shape \((B, S, V)\) and dtype == torch.float. The second item in the tuple is a tuple containing two tensor lists. The first tensor list represents the memory cell last internal states of each recurrent layer derived from current input token ids. Each tensor in the first list has shape \((B, \nBlk, \dBlk)\) and dtype == torch.float. The second tensor list represents the last hidden states of each recurrent layer derived from current input token ids. Each tensor in the second list has shape \((B, \dHid)\) and dtype == torch.float. Both structure are the same as batch_prev_states.

Return type

tuple[torch.Tensor, tuple[list[torch.Tensor], list[torch.Tensor]]]

params_init() None#

Initialize model parameters.

All weights and biases other than input / output gate biases are initialized with uniform distribution \(\mathcal{U}\pa{\init_l, \init_u}\). Input gate biases are initialized with uniform distribution \(\mathcal{U}\pa{\init_{ib}, 0}\). Output gate biases are initialized with uniform distribution \(\mathcal{U}\pa{\init_{ob}, 0}\).

Return type

None

See also

params_init

LSTM (1997 version) layer parameter initialization.

pred(batch_cur_tkids: Tensor, batch_prev_states: Optional[Tuple[List[Tensor], List[Tensor]]] = None) Tuple[Tensor, Tuple[List[Tensor], List[Tensor]]]#

Calculate next token id probability distribution over tokenizer’s vocabulary.

Probabilities were calculated based on previous hidden states, memory cell previous internal states and current input token id. This method is only used for inference. No tensor graphs are constructed and no gradients are calculated.

Parameters
  • batch_cur_tkids (torch.Tensor) – Batch current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.

  • batch_prev_states (Optional[tuple[list[torch.Tensor], list[torch.Tensor]]], default: None) – The first tensor list in the tuple is a batch of memory cell previous internal states. There are \(\nLyr\) tensors in the list, each has shape \((B, \nBlk, \dBlk)\) and dtype == torch.float. The second tensor list in the tuple is a batch of previous hidden states. There are \(\nLyr\) tensors in the list, each has shape \((B, \dHid)\) and dtype == torch.float. Set to None to use the initial hidden states and memory cell initial internal states of each layer.

Returns

The first item in the tuple is the batch of next token id probability distributions over the tokenizer’s vocabulary. Probability tensor has shape \((B, S, V)\) and dtype == torch.float. The second item in the tuple is a tuple containing two tensor lists. The first tensor list represents the memory cell last internal states of each recurrent layer derived from current input token ids. Each tensor in the first list has shape \((B, \nBlk, \dBlk)\) and dtype == torch.float. The second tensor list represents the last hidden states of each recurrent layer derived from current input token ids. Each tensor in the second list has shape \((B, \dHid)\) and dtype == torch.float. Both structure are the same as batch_prev_states.

Return type

tuple[torch.Tensor, tuple[list[torch.Tensor], list[torch.Tensor]]]

class lmp.model.LSTM2002Layer(*, d_blk: int = 1, in_feat: int = 1, init_fb: float = 1.0, init_ib: float = -1.0, init_lower: float = -0.1, init_ob: float = -1.0, init_upper: float = 0.1, n_blk: int = 1, **kwargs: Any)[source]#

Bases: LSTM2000Layer

LSTM (2002 version) 1 recurrent neural network.

  • Let \(\hIn\) be the number of input features per time step.

  • Let \(\dBlk\) be the number of units in a memory cell block.

  • Let \(\nBlk\) be the number of memory cell blocks.

  • Let \(\hOut = \nBlk \times \dBlk\) be the number of output features per time step.

  • Let \(x\) be a batch of sequences of input features with shape \((B, S, \hIn)\), where \(B\) is batch size and \(S\) is per sequence length.

  • Let \(h_0\) be the initial hidden states with shape \((B, \hOut)\).

  • Let \(c_0\) be the memory cell initial internal states with shape \((B, \nBlk, \dBlk)\).

LSTM (2002 version) layer is defined as follow:

\[\begin{split}\begin{align*} & \algoProc{\LSTMZeroTwoLayer}\pa{x, c_0, h_0} \\ & \indent{1} S \algoEq x.\sz{1} \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} \algoFor{k \in \set{1, \dots, \nBlk}} \\ & \indent{3} f_{t,k} \algoEq \sigma\pa{ W_{f,k} \cdot x_t + U_{f,k} \cdot h_{t-1} + V_{f,k} \cdot c_{t-1,k} + b_{f,k} } &&\tag{1}\label{1} \\ & \indent{3} i_{t,k} \algoEq \sigma\pa{ W_{i,k} \cdot x_t + U_{i,k} \cdot h_{t-1} + V_{i,k} \cdot c_{t-1,k} + b_{i,k} } &&\tag{2}\label{2} \\ & \indent{3} g_{t,k} \algoEq \tanh\pa{W_k \cdot x_t + U_k \cdot h_{t-1} + b_k} &&\tag{3}\label{3} \\ & \indent{3} c_{t,k} \algoEq f_{t, k} \cdot c_{t-1,k} + i_{t,k} \cdot g_{t,k} \\ & \indent{3} o_{t,k} \algoEq \sigma\pa{ W_{o,k} \cdot x_t + U_{o,k} \cdot h_{t-1} + V_{o,k} \cdot c_{t,k} + b_{o,k} } &&\tag{4}\label{4} \\ & \indent{3} h_{t,k} \algoEq o_{t,k} \cdot \tanh\pa{c_{t,k}} &&\tag{5}\label{5} \\ & \indent{2} \algoEndFor \\ & \indent{2} c_t \algoEq \cat{c_{t,1}, \dots, c_{t,\nBlk}} \\ & \indent{2} h_t \algoEq \fla{h_{t,1}, \dots, h_{t,\nBlk}} \\ & \indent{1} \algoEndFor \\ & \indent{1} c \algoEq \cat{c_1, \dots, c_S} \\ & \indent{1} h \algoEq \cat{h_1, \dots, h_S} \\ & \indent{1} \algoReturn (c, h) \\ & \algoEndProc \end{align*}\end{split}\]

Trainable Parameters

Nodes

Parameter

Shape

Symbol

Shape

\(U_{f,k}\)

\((1, \hOut)\)

\(c\)

\((B, S, \nBlk, \dBlk)\)

\(U_{i,k}\)

\((1, \hOut)\)

\(c_t\)

\((B, \nBlk, \dBlk)\)

\(U_k\)

\((\dBlk, \hOut)\)

\(c_{t,k}\)

\((B, \dBlk)\)

\(U_{o,k}\)

\((1, \hOut)\)

\(f_{t,k}\)

\((B, 1)\)

\(V_{f,k}\)

\((1, \dBlk)\)

\(g_{t,k}\)

\((B, \dBlk)\)

\(V_{i,k}\)

\((1, \dBlk)\)

\(h\)

\((B, S, \hOut)\)

\(V_{o,k}\)

\((1, \dBlk)\)

\(h_t\)

\((B, \hOut)\)

\(W_{f,k}\)

\((1, \hIn)\)

\(h_{t,k}\)

\((B, \dBlk)\)

\(W_{i,k}\)

\((1, \hIn)\)

\(i_{t,k}\)

\((B, 1)\)

\(W_k\)

\((\dBlk, \hIn)\)

\(o_{t,k}\)

\((B, 1)\)

\(W_{o,k}\)

\((1, \hIn)\)

\(x\)

\((B, S, \hIn)\)

\(b_{f,k}\)

\((1)\)

\(x_t\)

\((B, \hIn)\)

\(b_{i,k}\)

\((1)\)

\(b_k\)

\((\dBlk)\)

\(b_{o,k}\)

\((1)\)

  • The differences between LSTM2000Layer and LSTM2002Layer are list below:

    • Input, forget and output gate units have peephole connections directly connect to memory cell internal states. See \(\eqref{1}\eqref{2}\eqref{4}\).

    • Output gate units can be calculated only after updating memory cell internal states. See \(\eqref{4}\).

  • The implementation in the paper use identity mappings in \(\eqref{3}\eqref{5}\). Our implementation use \(\tanh\) instead. We argue that the changes in \(\eqref{3}\eqref{5}\) make model activations bounded and the paper implementation is unbounded. Since one usually use much larger dimension to train language model compare to the paper (which use dimension \(1\) on everything), activations of LSTM tend to grow to extremely positive / negative values without \(\tanh\).

Parameters
  • d_blk (int, default: 1) – Dimension of each memory cell block \(\dBlk\).

  • in_feat (int, default: 1) – Number of input features per time step \(\hIn\).

  • init_fb (float, default: 1.0) – Uniform distribution upper bound \(\init_{fb}\) used to initialize forget gate biases.

  • init_ib (float, default: -1.0) – Uniform distribution lower bound \(\init_{ib}\) used to initialize input gate biases.

  • init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

  • init_ob (float, default: -1.0) – Uniform distribution lower bound \(\init_{ob}\) used to initialize output gate biases.

  • init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

  • n_blk (int, default: 1) – Number of memory cell blocks \(\nBlk\).

  • kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.

c_0#

Memory cell blocks’ initial internal states \(c_0\). Shape: \((1, \nBlk, \dBlk)\).

Type

torch.Tensor

d_blk#

Number of units in a memory cell block \(\dBlk\).

Type

int

d_hid#

Total number of memory cell units \(\hOut\).

Type

int

fc_h2fg#

Fully connected layer \(\pa{U_{f,1}, \dots, U_{f,\nBlk}}\) which connects hidden states to memory cell’s forget gate units. Input shape: \((B, \dHid)\). Output shape: \((B, \nBlk)\).

Type

torch.nn.Linear

fc_h2ig#

Fully connected layer \(\pa{U_{i,1}, \dots, U_{i,\nBlk}}\) which connects hidden states to memory cell’s input gate units. Input shape: \((B, \dHid)\). Output shape: \((B, \nBlk)\).

Type

torch.nn.Linear

fc_h2mc_in#

Fully connected layers \(\pa{U_1, \dots, U_{\nBlk}}\) which connect hidden states to memory cell blocks’ input activations. Input shape: \((B, \dHid)\). Output shape: \((B, \dHid)\).

Type

torch.nn.Linear

fc_h2og#

Fully connected layer \(\pa{U_{o,1}, \dots, U_{o,\nBlk}}\) which connects hidden states to memory cell’s output gate units. Input shape: \((B, \dHid)\). Output shape: \((B, \nBlk)\).

Type

torch.nn.Linear

fc_x2fg#

Fully connected layer \(\pa{W_{f,1}, \dots, W_{f,\nBlk}}\) and \(\pa{b_{f,1}, \dots, b_{f,\nBlk}}\) which connects input units to memory cell’s forget gate units. Input shape: \((B, S, \hIn)\). Output shape: \((B, S, \nBlk)\).

Type

torch.nn.Linear

fc_x2ig#

Fully connected layer \(\pa{W_{i,1}, \dots, W_{i,\nBlk}}\) and \(\pa{b_{i,1}, \dots, b_{i,\nBlk}}\) which connects input units to memory cell’s input gate units. Input shape: \((B, S, \hIn)\). Output shape: \((B, S, \nBlk)\).

Type

torch.nn.Linear

fc_x2mc_in#

Fully connected layers \(\pa{W_1, \dots, W_{\nBlk}}\) and \(\pa{b_1, \dots, b_{\nBlk}}\) which connects input units to memory cell blocks’ input activations. Input shape: \((B, S, \hIn)\). Output shape: \((B, S, \dHid)\).

Type

torch.nn.Linear

fc_x2og#

Fully connected layer \(\pa{W_{o,1}, \dots, W_{o,\nBlk}}\) and \(\pa{b_{o,1}, \dots, b_{o,\nBlk}}\) which connects input units to memory cell’s output gate units. Input shape: \((B, S, \hIn)\). Output shape: \((B, S, \nBlk)\).

Type

torch.nn.Linear

h_0#

Initial hidden states \(h_0\). Shape: \((1, \dHid)\)

Type

torch.Tensor

in_feat#

Number of input features per time step \(\hIn\).

Type

int

init_fb#

Uniform distribution upper bound \(\init_{fb}\) used to initialize forget gate biases.

Type

float

init_ib#

Uniform distribution lower bound \(\init_{ib}\) used to initialize input gate biases.

Type

float

init_lower#

Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

Type

float

init_ob#

Uniform distribution lower bound \(\init_{ob}\) used to initialize output gate biases.

Type

float

init_upper#

Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

Type

float

n_blk#

Number of memory cell blocks \(\nBlk\).

Type

int

pc_c2fg#

Peephole connections \(\pa{V_{f,1}, \dots, V_{f,\nBlk}}\) which connect the \(k\)-th memory cell blocks’ internal states to the \(k\)-th forget gate units. Shape: \((1, \nBlk, \dBlk)\).

Type

torch.nn.Parameter

pc_c2ig#

Peephole connections \(\pa{V_{i,1}, \dots, V_{i,\nBlk}}\) which connect the \(k\)-th memory cell blocks’ internal states to the \(k\)-th input gate units. Shape: \((1, \nBlk, \dBlk)\).

Type

torch.nn.Parameter

pc_c2og#

Peephole connections \(\pa{V_{o,1}, \dots, V_{o,\nBlk}}\) which connect the \(k\)-th memory cell blocks’ internal states to the \(k\)-th output gate units. Shape: \((1, \nBlk, \dBlk)\).

Type

torch.nn.Parameter

See also

LSTM2000Layer

LSTM (2000 version) recurrent neural network.

forward(x: Tensor, c_0: Optional[Tensor] = None, h_0: Optional[Tensor] = None) Tuple[Tensor, Tensor][source]#

Calculate batch of hidden states for x.

Below we describe the forward pass algorithm of LSTM (2002 version) layer.

  1. Let x be a batch of sequences of input features \(x\).

  2. Let x.size(1) be sequence length \(S\).

  3. Let c_0 be the memory cell initial internal states \(c_0\). If c_0 is None, use self.c_0 instead.

  4. Let h_0 be the initial hidden states \(h_0\). If h_0 is None, use self.h_0 instead.

  5. Loop through \(\set{1, \dots, S}\) with looping index \(t\).

    1. Use \(x_t\), \(h_{t-1}\), \(c_{t-1}\), self.fc_x2fg, self.fc_h2fg and self.pc_c2fg to get forget gate units \(f_{t,1}, \dots, f_{t,\nBlk}\).

    2. Use \(x_t\), \(h_{t-1}\), \(c_{t-1}\), self.fc_x2ig, self.fc_h2ig and self.pc_c2ig to get input gate units \(i_{t,1}, \dots, i_{t,\nBlk}\).

    3. Use \(x_t\), \(h_{t-1}\), self.fc_x2mc_in and self.fc_h2mc_in to get memory cell input activations \(g_{t,1}, \dots, g_{t,\nBlk}\).

    4. Derive new internal state \(c_{t,1}, \dots, c_{t,\nBlk}\) using forget gates units \(f_{t,1}, \dots, f_{t,\nBlk}\), input gate units \(i_{t,1}, \dots, i_{t,\nBlk}\) and memory cell input activations \(g_{t,1}, \dots, g_{t,\nBlk}\).

    5. Use \(x_t\), \(h_{t-1}\), \(c_{t,1}, \dots, c_{t,\nBlk}\), self.fc_x2og, self.fc_h2og and self.pc_c2og to get output gate units \(o_{t,1}, \dots, o_{t,\nBlk}\).

    6. Derive new hidden states \(h_t\) using output gate units \(o_{t,1}, \dots, o_{t,\nBlk}\) and memory cell new internal states \(c_{t,1}, \dots, c_{t,\nBlk}\).

  6. Denote the concatenation of memory cell internal states \(c_1, \dots, c_S\) as \(c\).

  7. Denote the concatenation of hidden states \(h_1, \dots, h_S\) as \(h\).

  8. Return \((c, h)\).

Parameters
  • x (torch.Tensor) – Batch of sequences of input features. x has shape \((B, S, \hIn)\) and dtype == torch.float.

  • c_0 (Optional[torch.Tensor], default: None) – Batch of memory cell previous internal states. The tensor has shape \((B, \nBlk, \dBlk)\) and dtype == torch.float. Set to None to use the initial memory internal state self.c_0.

  • h_0 (Optional[torch.Tensor], default: None) – Batch of previous hidden states. The tensor has shape \((B, \hOut)\) and dtype == torch.float. Set to None to use the initial hidden states self.h_0.

Returns

The first tensor is batch of memory cell internal states and the second tensor is batch of hidden states. Batch memory cell internal states has shape \((B, S, \nBlk, \dBlk)\) and dtype == torch.float. Batch hidden states has shape \((B, S, \hOut)\) and dtype == torch.float.

Return type

tuple[torch.Tensor, torch.Tensor]

params_init() None[source]#

Initialize model parameters.

All weights and biases other than \(b_f, b_i, b_o\) are initialized with uniform distribution \(\mathcal{U}\pa{\init_l, \init_u}\). \(b_f\) is initialized with uniform distribution \(\mathcal{U}\pa{0, \init_{fb}}\). \(b_i\) is initialized with uniform distribution \(\mathcal{U}\pa{\init_{ib}, 0}\). \(b_o\) is initialized with uniform distribution \(\mathcal{U}\pa{\init_{ob}, 0}\). \(b_f\) is initialized separatedly so that forget gates remain open at the start of training. \(b_i, b_o\) are initialized separatedly so that input and output gates remain closed at the start of training.

Return type

None

See also

params_init

LSTM (2000 version) layer parameter initialization.

1(1,2)

Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. Learning precise timing with lstm recurrent networks. Journal of machine learning research, 3(Aug):115–143, 2002. URL: https://www.jmlr.org/papers/v3/gers02a.html.