Elman-Net#

class lmp.model.ElmanNet(*, d_emb: int = 1, d_hid: int = 1, init_lower: float = -0.1, init_upper: float = 0.1, label_smoothing: float = 0.0, n_lyr: int = 1, p_emb: float = 0.0, p_hid: float = 0.0, tknzr: BaseTknzr, **kwargs: Any)[source]#

Bases: BaseModel

Elman Net 1 language model.

Let \(x\) be batch of token ids with batch size \(B\) and per sequence length \(S\).
Let \(V\) be the vocabulary size of the paired tokenizer. Each token id represents an unique token, i.e., \(x_t \in \set{1, \dots, V}\).
Let \(E\) be the token embedding lookup table.
- Let \(\dEmb\) be the dimension of token embeddings.
- Let \(e_t\) be the token embedding correspond to token id \(x_t\).
- Token embeddings have dropout probability \(\pEmb\).
Let \(\nLyr\) be the number of recurrent layers.
Let \(\dHid\) be the number of recurrent units in each recurrent layer.
Let \(h^\ell\) be the hidden states of the \(\ell\) th recurrent layer.
- Let \(h_t^\ell\) be the \(t\) th time step of \(h^\ell\).
- The initial hidden states \(h_0^\ell\) are given as input.
- Hidden states have dropout probability \(\pHid\).

Elman Net language model is defined as follow:

\[\begin{split}\begin{align*} & \algoProc{\ElmanNet}\pa{x, \br{h_0^1, \dots, h_0^{\nLyr}}} \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} e_t \algoEq (x_t)\text{-th row of } E \text{ but treated as column vector} \\ & \indent{2} \widehat{e_t} \algoEq \drop{e_t}{\pEmb} \\ & \indent{2} h_t^0 \algoEq \tanh\pa{W_h \cdot \widehat{e_t} + b_h} \\ & \indent{1} \algoEndFor \\ & \indent{1} h^0 \algoEq \cat{h_1^0, \dots, h_S^0} \\ & \indent{1} \widehat{h^0} \algoEq \drop{h^0}{\pHid} \\ & \indent{1} \algoFor{\ell \in \set{1, \dots, \nLyr}} \\ & \indent{2} h^\ell \algoEq \ElmanNetLayer\pa{x \algoEq \widehat{h^{\ell-1}}, h_0 \algoEq h_0^\ell} \\ & \indent{2} \widehat{h^\ell} \algoEq \drop{h^\ell}{\pHid} \\ & \indent{1} \algoEndFor \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} z_t \algoEq \tanh\pa{W_z \cdot h_t^{\nLyr} + b_z} \\ & \indent{2} \widehat{z_t} \algoEq \drop{z_t}{\pHid} \\ & \indent{2} y_t \algoEq \sof{E \cdot \widehat{z_t}} \\ & \indent{1} \algoEndFor \\ & \indent{1} y \algoEq \cat{y_1, \dots, y_S} \\ & \indent{1} \algoReturn \pa{y, \br{h_S^1, \dots, h_S^{\nLyr}}} \\ & \algoEndProc \end{align*}\end{split}\]

Trainable Parameters		Nodes
Parameter	Shape	Symbol	Shape
\(E\)	\((V, \dEmb)\)	\(e_t\)	\((B, S, \dEmb)\)
\(W_h\)	\((\dHid, \dEmb)\)	\(\widehat{e_t}\)	\((B, S, \dEmb)\)
\(W_z\)	\((\dEmb, \dHid)\)	\(h^\ell\)	\((B, S, \dHid)\)
\(b_h\)	\((\dHid)\)	\(h_t^\ell\)	\((B, \dHid)\)
\(b_z\)	\((\dEmb)\)	\(\widehat{h^\ell}\)	\((B, S, \dHid)\)
\(\ElmanNetLayer\)		\(x\)	\((B, S)\)
		\(x_t\)	\((B)\)
		\(y\)	\((B, S, V)\)
		\(y_t\)	\((B, V)\)
		\(z_t\)	\((B, \dEmb)\)
		\(\widehat{z_t}\)	\((B, \dEmb)\)

The goal of optimization is to minimize the negative logliklihood of next token id \(x_{t+1}\) given \(y_t\). The prediction loss is defined to be the average negative logliklihood over \(x\) given \(y\).

\[\loss = \dfrac{-1}{S} \sum_{t = 1}^S \log \Pr(x_{t+1} \vert y_t).\]

\(z_t\) is obtained by transforming \(h_t^{\nLyr}\) from dimension \(\dHid\) to \(\dEmb\). This is only need for shape consistency: the hidden states \(h_t^{\nLyr}\) has shape \((B, \dHid)\) and \(E\) has shape \((V, \dEmb)\).
\(y_t\) is the next token id prediction probability distribution over tokenizer’s vocabulary. We use inner product to calculate similarity scores over all token ids, and then use softmax to normalize similarity scores into probability range \([0, 1]\).
Our implementation use \(\tanh\) as activation function instead of the sigmoid function as used in the paper. The consideration here is simply to allow embeddings have negative values.
Model parameters in Elman Net are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) of uniform distribution are given as hyperparameters.

Parameters

d_emb (int, default: 1) – Token embedding dimension \(\dEmb\).
d_hid (int, default: 1) – Hidden states dimension \(\dHid\).
init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.
label_smoothing (float, default: 0.0) – Smoothing applied on prediction target \(x_{t+1}\).
n_lyr (int, default: 1) – Number of recurrent layers \(\nLyr\).
p_emb (float, default: 0.0) – Embeddings dropout probability \(\pEmb\).
p_hid (float, default: 0.0) – Hidden units dropout probability \(\pHid\).
tknzr (BaseTknzr) – Tokenizer instance.

d_emb#

Token embedding dimension \(\dEmb\).

Type: int

d_hid#

Hidden states dimension \(\dHid\).

Type: int

emb#

Token embedding lookup table \(E\). Input shape: \((B, S)\). Output shape: \((B, S, \dEmb)\).

Type: torch.nn.Embedding

fc_e2h#

Fully connected layer \(W_h\) and \(b_h\) which connects input units to the 1st recurrent layer’s input. Dropout with probability \(\pEmb\) is applied to input. Dropout with probability \(\pHid\) is applied to output. Input shape: \((B, S, \dEmb)\). Output shape: \((B, S, \dHid)\).

Type: torch.nn.Sequential

fc_h2e#

Fully connected layer \(W_z\) and \(b_z\) which transforms hidden states to next token embeddings. Dropout with probability \(\pHid\) is applied to output. Input shape: \((B, S, \dHid)\). Output shape: \((B, S, \dEmb)\).

Type: torch.nn.Sequential

init_lower#

Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

Type: float

init_upper#

Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

Type: float

label_smoothing#

Smoothing applied on prediction target \(x_{t+1}\).

Type: float

loss_fn#

Loss function to be optimized.

Type: torch.nn.CrossEntropyLoss

model_name#

CLI name of Elman Net is Elman-Net.

Type: ClassVar[str]

n_lyr#

Number of recurrent layers \(\nLyr\).

Type: int

p_emb#

Embeddings dropout probability \(\pEmb\).

Type: float

p_hid#

Hidden units dropout probability \(\pHid\).

Type: float

stack_rnn#

ElmanNetLayer stacking layers. Each Elman Net layer is followed by a dropout layer with probability \(\pHid\). The number of stacking layers is equal to \(2 \nLyr\). Input shape: \((B, S, \dHid)\). Output shape: \((B, S, \dHid)\).

Type: torch.nn.ModuleList

See also

ElmanNetLayer: Elman Net recurrent neural network.

classmethod add_CLI_args(parser: ArgumentParser) → None[source]#

Add Elman Net language model hyperparameters to CLI argument parser.

Parameters: parser (argparse.ArgumentParser) – CLI argument parser.
Return type: None

See also

lmp.script.train_model: Language model training script.

Examples

>>> import argparse
>>> import math
>>> from lmp.model import ElmanNet
>>> parser = argparse.ArgumentParser()
>>> ElmanNet.add_CLI_args(parser)
>>> args = parser.parse_args([
...   '--d_emb', '2',
...   '--d_hid', '4',
...   '--init_lower', '-0.01',
...   '--init_upper', '0.01',
...   '--label_smoothing', '0.1',
...   '--n_lyr', '2',
...   '--p_emb', '0.5',
...   '--p_hid', '0.1',
... ])
>>> assert args.d_emb == 2
>>> assert args.d_hid == 4
>>> assert math.isclose(args.init_lower, -0.01)
>>> assert math.isclose(args.init_upper, 0.01)
>>> assert math.isclose(args.label_smoothing, 0.1)
>>> assert args.n_lyr == 2
>>> assert math.isclose(args.p_emb, 0.5)
>>> assert math.isclose(args.p_hid, 0.1)

cal_loss(batch_cur_tkids: Tensor, batch_next_tkids: Tensor, batch_prev_states: Optional[List[Tensor]] = None) → Tuple[Tensor, List[Tensor]][source]#

Calculate language model prediction loss.

We use cross entropy loss as our training objective. This method is only used for training.

Parameters

batch_cur_tkids (torch.Tensor) – Batch current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.
batch_next_tkids (torch.Tensor) – Prediction target of each sample in the batch. batch_next_tkids has shape \((B, S)\) and dtype == torch.long.
batch_prev_states (Optional[list[torch.Tensor]], default: None) – Batch of previous hidden states. There are \(\nLyr\) tensors in the list. Each tensor in the list has shape \((B, \dHid)\) and dtype == torch.float. Set to None to use the initial hidden states of each layer.

Returns

The first item in the tuple is the mini-batch cross-entropy loss. Loss tensor has shape \((1)\) and dtype == torch.float. The second item in the tuple is a list of tensor. Each tensor in the list is the last hiddent states of each recurrent layer derived from current input token ids. Each tensor in the list has shape \((B, \dHid)\) and dtype == torch.float.

Return type

tuple[torch.Tensor, list[torch.Tensor]]

forward(batch_cur_tkids: Tensor, batch_prev_states: Optional[List[Tensor]] = None) → Tuple[Tensor, List[Tensor]][source]#

Calculate next token id logits.

Logits were calculated based on previous hidden states and current input token ids. Use pred to convert logits into next token id probability distribution over tokenizer’s vocabulary. Use cal_loss to convert logits into next token id prediction loss. Below we describe the forward pass algorithm of Elman Net language model.

Use token ids to lookup token embeddings with self.emb.
Use self.fc_e2h to transform token embeddings into 1st recurrent layer’s input.
Feed transformation result into recurrent layer and output hidden states. We use teacher forcing in this step when perform training, i.e., inputs are directly given instead of generated by model.
Feed the output of previous recurrent layer into next recurrent layer until all layers have been used once.
Use self.fc_h2e to transform last recurrent layer’s hidden states to next token embeddings.
Perform inner product on token embeddings over tokenizer’s vocabulary to get similarity scores.
Return similarity scores (logits).

Parameters

batch_cur_tkids (torch.Tensor) – Batch current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.
batch_prev_states (Optional[list[torch.Tensor]], default: None) – Batch of previous hidden states. There are \(\nLyr\) tensors in the list. Each tensor in the list has shape \((B, \dHid)\) and dtype == torch.float. Set to None to use the initial hidden states of each layer.

Returns

The first item in the tuple is the batch of next token id logits with shape \((B, S, V)\) and dtype == torch.float. The second item in the tuple is a list of tensor. Each tensor in the list is the last hiddent states of each recurrent layer derived from current input token ids. Each tensor in the list has shape \((B, \dHid)\) and dtype == torch.float.

Return type

tuple[torch.Tensor, list[torch.Tensor]]

See also

enc: Source of token ids.

params_init() → None[source]#

Initialize model parameters.

All weights and biases are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\).

Return type: None

See also

params_init: Elman Net layer parameter initialization.

pred(batch_cur_tkids: Tensor, batch_prev_states: Optional[List[Tensor]] = None) → Tuple[Tensor, List[Tensor]][source]#

Calculate next token id probability distribution over tokenizer’s vocabulary.

Probabilities were calculated based on previous hidden states and current input token id. This method is only used for inference. No tensor graphs are constructed and no gradients are calculated.

Parameters

batch_cur_tkids (torch.Tensor) – Batch current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.
batch_prev_states (Optional[list[torch.Tensor]], default: None) – Batch of previous hidden states. There are \(\nLyr\) tensors in the list. Each tensor in the list has shape \((B, \dHid)\) and dtype == torch.float. Set to None to use the initial hidden states of each layer.

Returns

The first item in the tuple is the batch of next token id probability distributions over the paired tokenizer’s vocabulary. Probability tensor has shape \((B, S, V)\) and dtype == torch.float. The second item in the tuple is a list of tensor. Each tensor in the list is the last hiddent states of each recurrent layer derived from current input token ids. Each tensor in the list has shape \((B, \dHid)\) and dtype == torch.float.

Return type

tuple[torch.Tensor, list[torch.Tensor]]

class lmp.model.ElmanNetLayer(*, in_feat: int = 1, init_lower: float = -0.1, init_upper: float = 0.1, out_feat: int = 1, **kwargs: Any)[source]#

Bases: Module

Elman Net 1 recurrent neural network.

Let \(\hIn\) be the number of input features per time step.
Let \(\hOut\) be the number of output features per time step.
Let \(x\) be a batch of sequence of input features with shape \((B, S, \hIn)\), where \(B\) is batch size and \(S\) is per sequence length.
Let \(h_0\) be the initial hidden states with shape \((B, \hOut)\).

Elman Net layer is defined as follow:

\[\begin{split}\begin{align*} & \algoProc{\ElmanNetLayer}\pa{x, h_0} \\ & \indent{1} S \algoEq x.\sz{1} \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} h_t \algoEq \tanh\pa{W \cdot x_t + U \cdot h_{t-1} + b} \\ & \indent{1} \algoEndFor \\ & \indent{1} h \algoEq \cat{h_1, \dots, h_S} \\ & \indent{1} \algoReturn h \\ & \algoEndProc \end{align*}\end{split}\]

Trainable Parameters		Nodes
Parameter	Shape	Symbol	Shape
\(U\)	\((\hOut, \hOut)\)	\(h\)	\((B, S, \hOut)\)
\(W\)	\((\hOut, \hIn)\)	\(h_t\)	\((B, \hOut)\)
\(b\)	\((\hOut)\)	\(x\)	\((B, S, \hIn)\)
		\(x_t\)	\((B, \hIn)\)

Our implementation use \(\tanh\) as activation function instead of the sigmoid function as used in the paper. The consideration here is simply to allow embeddings have negative values.
Model parameters in Elman Net layer are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) are given as hyperparameters.

Parameters

in_feat (int, default: 1) – Number of input features per time step \(\hIn\).
init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.
out_feat (int, default: 1) – Number of output features per time step \(\hOut\).

fc_x2h#

Fully connected layer with parameters \(W\) and \(b\) which connects input units to recurrent units. Input shape: \((B, S, \hIn)\). Output shape: \((B, S, \hOut)\).

Type: torch.nn.Linear

fc_h2h#

Fully connected layer \(U\) which connects recurrent units to recurrent units. Input shape: \((B, \hOut)\). Output shape: \((B, \hOut)\).

Type: torch.nn.Linear

h_0#

Initial hidden states \(h_0\). Shape: \((1, \hOut)\)

Type: torch.Tensor

in_feat#

Number of input features per time step \(\hIn\).

Type: int

init_lower#

Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

Type: float

init_upper#

Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

Type: float

out_feat#

Number of output features per time step \(\hOut\).

Type: int

forward(x: Tensor, h_0: Optional[Tensor] = None) → Tensor[source]#

Calculate batch of hidden states for x.

Below we describe the forward pass algorithm of Elman Net layer.

Let x be a batch of sequences of input features \(x\).
Let x.size(1) be sequence length \(S\).
Let h_0 be the initial hidden states \(h_0\). If h_0 is None, use self.h_0 instead.
Loop through \(\set{1, \dots, S}\) with looping index \(t\).
1. Use self.fc_x2h to transform input features at current time steps \(x_t\).
2. Use self.fc_h2h to transform recurrent units at previous time steps \(h_{t-1}\).
3. Add the transformations and use \(\tanh\) as activation function to get the recurrent units at current time steps \(h_t\).
Denote the concatenation of hidden states \(h_1, \dots, h_S\) as \(h\).
Return \(h\).

Parameters

x (torch.Tensor) – Batch of sequences of input features. x has shape \((B, S, \hIn)\) and dtype == torch.float.
h_0 (torch.Tensor, default: None) – Batch of previous hidden states. The tensor has shape \((B, \hOut)\) and dtype == torch.float. Set to None to use the initial hidden states self.h_0.

Returns

Batch of current hidden states \(h\). Returned tensor has shape \((B, S, \hOut)\) and dtype == torch.float.

Return type

torch.Tensor

params_init() → None[source]#

Initialize model parameters.

All weights and biases are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\).

Return type: None

1(1,2): Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. URL: https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog1402_1.