lmp.model._trans_enc#

Transformer language model.

class lmp.model._trans_enc.MultiHeadAttnLayer(*, d_k: int = 1, d_model: int = 1, d_v: int = 1, init_lower: float = -0.1, init_upper: float = 0.1, n_head: int = 1, **kwargs: Any)[source]#

Bases: Module

Multi-head attention 1 layer.

  • Let \(B\) be input mini-batch size.

  • Let \(S_q\) be the length of each query sequence.

  • Let \(S_k\) be the length of each key sequence.

  • Let \(\dMdl\) be the number of features per time step in each sequence.

  • Let \(q\) be a batch of query sequence with shape \((B, S_q, \dMdl)\).

  • Let \(k\) be a batch of key sequence with shape \((B, S_k, \dMdl)\).

  • Let \(v\) be a batch of value sequence with shape \((B, S_k, \dMdl)\).

  • Let \(\msk\) be a batch of attention mask with shape \((B, S_q, S_k)\).

  • Let \(\nHd\) be the number of attention heads.

  • Let \(d_k\) be the number of key features in each attention head.

  • Let \(d_v\) be the number of value features in each attention head.

Multi-head attention layer is defined as follow:

\[\begin{split}\begin{align*} & \algoProc{\MultiHeadAttnLayer}(k, \msk, q, v) \\ & \indent{1} S_q \algoEq q.\sz{1} \\ & \indent{1} S_k \algoEq k.\sz{1} \\ & \indent{1} \algoFor{h \in \set{1, \dots, \nHd}} \\ & \indent{2} \algoCmt{Get query vector for each head.} \\ & \indent{2} \algoFor{t \in \set{1, \dots, S_q}} \\ & \indent{3} q_t^h \algoEq W_{Q, h} \cdot q_t \\ & \indent{2} \algoEndFor \\ & \indent{2} Q^h \algoEq \cat{q_1^h, \dots, q_{S_q}^h} \\ & \indent{2} \algoCmt{Get key-value vectors for each head.} \\ & \indent{2} \algoFor{t \in \set{1, \dots, S_k}} \\ & \indent{3} k_t^h \algoEq W_{K, h} \cdot k_t \\ & \indent{3} v_t^h \algoEq W_{V, h} \cdot v_t \\ & \indent{2} \algoEndFor \\ & \indent{2} K^h \algoEq \cat{k_1^h, \dots, k_{S_k}^h} \\ & \indent{2} V^h \algoEq \cat{v_1^h, \dots, v_{S_k}^h} \\ & \indent{2} \algoCmt{Apply attention mask on similarity scores.} \\ & \indent{2} \Sim^h \algoEq \dfrac{Q^h \cdot \pa{K^h}^\top}{\sqrt{d_k}} \\ & \indent{2} \algoFor{i \in \set{1, \dots, S_q}} \\ & \indent{3} \algoFor{j \in \set{1, \dots, S_k}} \\ & \indent{4} \algoIf{\msk_{i,j} \algoIs \algoTrue} \\ & \indent{5} \Sim_{i,j}^h \algoEq -10^9 \\ & \indent{4} \algoEndIf \\ & \indent{3} \algoEndFor \\ & \indent{3} \attn_i^h \algoEq \sof{\Sim_{i,1}^h, \dots, \Sim_{i,S_k}^h} \\ & \indent{2} \algoEndFor \\ & \indent{2} \algoCmt{Get attention scores.} \\ & \indent{2} \attn^h \algoEq \cat{\attn_1^h, \dots, \attn_{S_q}^h} \\ & \indent{2} F^h \algoEq \attn^h \cdot V^h \\ & \indent{1} \algoEndFor \\ & \indent{1} F \algoEq \fla{F^1, \dots, F^{\nHd}} \\ & \indent{1} O \algoEq W_O \cdot F \\ & \indent{1} \algoReturn O \\ & \algoEndProc \end{align*}\end{split}\]

Trainable Parameters

Nodes

Parameter

Shape

Symbol

Shape

\(W_{K,h}\)

\((d_k, \dMdl)\)

\(F\)

\((B, S_q, \nHd \times d_v)\)

\(W_O\)

\((\dMdl, \nHd \times d_v)\)

\(F^h\)

\((B, S_q, d_v)\)

\(W_{Q,h}\)

\((d_k, \dMdl)\)

\(K^h\)

\((B, S_k, d_k)\)

\(W_{V,h}\)

\((d_v, \dMdl)\)

\(O\)

\((B, S_q, \dMdl)\)

\(Q^h\)

\((B, S_q, d_k)\)

\(V^h\)

\((B, S_k, d_v)\)

\(\attn^h\)

\((B, S_q, S_k)\)

\(\attn_i^h\)

\((B, S_k)\)

\(k\)

\((B, S_k, \dMdl)\)

\(k_t\)

\((B, \dMdl)\)

\(k_t^h\)

\((B, d_k)\)

\(\msk\)

\((B, S_q, S_k)\)

\(\msk_{i,j}\)

\((B)\)

\(q\)

\((B, S_q, \dMdl)\)

\(q_t\)

\((B, \dMdl)\)

\(q_t^h\)

\((B, d_k)\)

\(\Sim^h\)

\((B, S_q, S_k)\)

\(\Sim_{i,j}^h\)

\((B)\)

\(v\)

\((B, S_k, \dMdl)\)

\(v_t\)

\((B, \dMdl)\)

\(v_t^h\)

\((B, d_v)\)

Model parameters in Multi-head attention layer are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) are given as hyperparameters.

Parameters
  • d_k (int, default: 1) – Number of key features \(d_k\) in each head.

  • d_model (int, default: 1) – Number of input / output features \(\dMdl\).

  • d_v (int, default: 1) – Number of value features \(d_v\) in each head.

  • init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

  • init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

  • kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.

  • n_head (int, default: 1) – Number of attention heads \(\nHd\).

d_k#

Number of key features \(d_k\) in each head.

Type

int

d_model#

Number of input / output features \(\dMdl\).

Type

int

d_v#

Number of value features \(d_v\) in each head.

Type

int

fc_ff_f2o#

Fully connected feed-forward layer \(W_O\) which transform features to output. No biases are used. Input shape: \((B, S_q, \nHd \times d_v)\). Output shape: \((B, S_q, \dMdl)\).

Type

torch.nn.Linear

fc_ff_k2hk#

Fully connected feed-forward layer \(\pa{W_{K,1}, \dots, W_{K,\nHd}}\) which transform key vectors to heads. No biases are used. Input shape: \((B, S_k, \dMdl)\). Output shape: \((B, S_k, \nHd \times d_k)\).

Type

torch.nn.Linear

fc_ff_q2hq#

Fully connected feed-forward layer \(\pa{W_{Q,1}, \dots, W_{Q,\nHd}}\) which transform query vectors to heads. No biases are used. Input shape: \((B, S_q, \dMdl)\). Output shape: \((B, S_q, \nHd \times d_k)\).

Type

torch.nn.Linear

fc_ff_v2hv#

Fully connected feed-forward layer \(\pa{W_{V,1}, \dots, W_{V,\nHd}}\) which transform value vectors to heads. No biases are used. Input shape: \((B, S_k, \dMdl)\). Output shape: \((B, S_k, \nHd \times d_v)\).

Type

torch.nn.Linear

init_lower#

Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

Type

float

init_upper#

Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

Type

float

n_head#

Number of attention heads \(\nHd\).

Type

int

scaler#

Dot product scaler \(\dfrac{1}{\sqrt{d_k}}\).

Type

float

forward(k: Tensor, mask: Tensor, q: Tensor, v: Tensor) Tensor[source]#

Perform multi-head attention on query, key, value.

Below we describe the forward pass algorithm of multi-head attention layer.

  1. Let q be a batch of sequences of query vectors \(q\).

  2. Let k be a batch of sequences of key vectors \(k\).

  3. Let v be a batch of sequences of value vectors \(v\).

  4. Let mask be a batch of attention mask \(\msk\).

  5. Let q.size(1) be sequence length \(S_q\).

  6. Let k.size(1) be sequence length \(S_k\).

  7. Use self.fc_ff_q2hq to transform query vectors into multi-head query vectors \(Q^1, \dots, Q^{\nHd}\).

  8. Use self.fc_ff_k2hk to transform query vectors into multi-head query vectors \(K^1, \dots, K^{\nHd}\).

  9. Use self.fc_ff_v2hv to transform query vectors into multi-head query vectors \(V^1, \dots, V^{\nHd}\).

  10. Use \(Q^1, \dots, Q^{\nHd}\) and \(K^1, \dots, K^{\nHd}\) to calculate similarity scores \(\Sim^1, \dots, \Sim^{\nHd}\).

  11. Use mask to mask similarity scores \(\Sim^1, \dots, \Sim^{\nHd}\).

  12. Use softmax to transform similarity scores \(\Sim^1, \dots, \Sim^{\nHd}\) into attention scores \(\attn^1, \dots, \attn^{\nHd}\).

  13. Use attention scores \(\attn^1, \dots, \attn^{\nHd}\) and \(V^1, \dots, V^{\nHd}\) to calculate hidden features \(F^1, \dots, F^{\nHd}\).

  14. Use \(W_O\) and hidden features \(F^1, \dots, F^{\nHd}\) to calculate output \(O\).

  15. Return \(O\).

Parameters
  • k (torch.Tensor) – Batch of sequences of key vectors with shape \((B, S_k, \dMdl)\) and dtype == torch.float.

  • mask (torch.Tensor) – Batch of attention mask with shape \((B, S_q, S_k)\) and dtype == torch.bool. Set to true to mask attention at corresponding position.

  • q (torch.Tensor) – Batch of sequences of query vectors with shape \((B, S_q, \dMdl)\) and dtype == torch.float.

  • v (torch.Tensor) – Batch of sequences of key vectors with shape \((B, S_k, \dMdl)\) and dtype == torch.float.

Returns

Batch output features \(O\) with shape \((B, S_q, \dMdl)\) and dtype == torch.float.

Return type

torch.Tensor

params_init() None[source]#

Initialize model parameters.

All weights are initialized with uniform distribution \(\mathcal{U}\pa{\init_l, \init_u}\).

Return type

None

class lmp.model._trans_enc.PosEncLayer(*, d_emb: int = 1, max_seq_len: int = 512, **kwargs: Any)[source]#

Bases: Module

Positional Encoding 1.

  • Let \(S\) be the lookup sequence length.

  • Let \(\dEmb\) be the dimension of positional encodings.

Positional encodings is defined as follow:

\[\begin{split}\begin{align*} & \algoProc{\PosEncLayer}\pa{S} \\ & \indent{1} \algoFor{\pos \in \set{1, \dots, S}} \\ & \indent{2} \algoFor{i \in \set{1, \dots, \dEmb}} \\ & \indent{3} \algoIf{i \text{ is even}} \\ & \indent{4} \PE_{(\pos,i)} \algoEq \sin\pa{\dfrac{\pos}{10000^{i / \dEmb}}} \\ & \indent{3} \algoElse \\ & \indent{4} \PE_{(\pos,i)} \algoEq \cos\pa{\dfrac{\pos}{10000^{i / \dEmb}}} \\ & \indent{3} \algoEndIf \\ & \indent{2} \algoEndFor \\ & \indent{1} \algoEndFor \\ & \indent{1} \algoReturn \PE \\ & \algoEndProc \end{align*}\end{split}\]

Trainable Parameters

Nodes

Parameter

Shape

Symbol

Shape

\(\PE_{(\pos,i)}\)

\((1)\)

\(\PE\)

\((1, S, \dEmb)\)

Parameters
  • d_emb (int, default: 1) – Positional encoding dimension \(\dEmb\).

  • kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.

  • max_seq_len (int, default: 512) – Maximum length constraint on the input sequence.

d_emb#

Positional encoding dimension \(\dEmb\).

Type

int

max_seq_len#

Maximum length constraint on the input sequence.

Type

int

pe#

Positional encoding lookup table.

Type

torch.Tensor

forward(seq_len: int) Tensor[source]#

Lookup positional encodings.

Lookup is starting from position 0 and end at position seq_len - 1 (exclusive).

Parameters

seq_len (int) – Sequence length \(S\).

Returns

Positional encodings with shape \((1, S, \dEmb)\) and dtype == torch.float.

Return type

torch.Tensor

params_init() None[source]#

Do nothing.

Return type

None

class lmp.model._trans_enc.TransEnc(*, d_ff: int = 1, d_k: int = 1, d_model: int = 1, d_v: int = 1, init_lower: float = -0.1, init_upper: float = 0.1, label_smoothing: float = 0.0, max_seq_len: int = 512, n_head: int = 1, n_lyr: int = 1, p: float = 0.0, tknzr: BaseTknzr, **kwargs: Any)[source]#

Bases: BaseModel

Transformer encoder 1 language model.

  • Let \(x\) be batch of token ids with batch size \(B\) and per sequence length \(S\).

  • Let \(c\) be previous batch of token ids (previous context window) with shape \((B, S')\). Note that \(c\) can be empty.

  • Let \(S_\max\) be the maximum sequence length a model can deal with.

    • When \(c\) is empty, the constraint \(S \leq S_\max\) must be satisfied.

    • When \(c\) is not empty, the constraint \(S + S' \leq S_\max\) must be satisfied.

  • Let \(V\) be the vocabulary size of the paired tokenizer. Each token id represents an unique token, i.e., \(x_t \in \set{1, \dots, V}\).

  • Let \(E\) be the token embedding lookup table.

    • Let \(\dMdl\) be the dimension of token embeddings.

    • Let \(e_t\) be the token embedding correspond to token id \(x_t\).

    • Token embeddings have dropout probability \(p\).

  • Let \(\PE\) be positional encoding layer.

    • Let \(\PE_t\) be the positional encoding at the \(t\) th position.

    • The dimension of positional encodings is \(\dMdl\).

  • Let \(\nLyr\) be the number of transformer encoder layers.

  • Let \(h^\ell\) be the output of the \(\ell\) th transformer encoder layer.

Transformer encoder language model is defined as follow:

\[\begin{split}\begin{align*} & \algoProc{\TransEnc}\pa{x, c} \\ & \indent{1} \algoIf{c \text{ is not empty}} \\ & \indent{2} x \algoEq \cat{x, c} \\ & \indent{1} \algoEndIf \\ & \indent{1} S \algoEq x.\sz{1} \\ & \indent{1} \algoCmt{Create attention mask.} \\ & \indent{1} \algoFor{i \in \set{1, \dots, S}} \\ & \indent{2} \algoFor{j \in \set{1, \dots, S}} \\ & \indent{3} \algoIf{x_i \algoIs \text{padding}} \\ & \indent{4} \msk_{i,j} \algoEq \algoTrue \\ & \indent{3} \algoElseIf{i \leq j} \\ & \indent{4} \msk_{i,j} \algoEq \algoFalse \\ & \indent{3} \algoElse \\ & \indent{4} \msk_{i,j} \algoEq \algoTrue \\ & \indent{3} \algoEndIf \\ & \indent{2} \algoEndFor \\ & \indent{1} \algoEndFor \\ & \indent{1} \algoCmt{Lookup token embedding and positional encoding.} \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} e_t \algoEq (x_t)\text{-th row of } E \text{ but treated as column vector} \\ & \indent{2} h_t^0 \algoEq \drop{e_t + \PE_t}{p} \\ & \indent{1} \algoEndFor \\ & \indent{1} h^0 \algoEq \cat{h_1^0, \dots, h_S^0} \\ & \indent{1} \algoCmt{Perform forward pass on stacking Transformer encoder layers} \\ & \indent{1} \algoFor{\ell \in \set{1, \dots, \nLyr}} \\ & \indent{2} h^\ell \algoEq \TransEncLayer\pa{ k \algoEq h^{\ell-1}, \msk \algoEq \msk, q \algoEq h^{\ell-1}, v \algoEq h^{\ell-1} } \\ & \indent{1} \algoEndFor \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} y_t \algoEq \sof{E \cdot h_t^{\nLyr}} \\ & \indent{1} \algoEndFor \\ & \indent{1} y \algoEq \cat{y_1, \dots, y_S} \\ & \indent{1} c' \algoEq \cat{x_{\max\pa{1, S - (S_\max-2)}}, \dots, x_S} \\ & \indent{1} \algoReturn \pa{y, c'} \\ & \algoEndProc \end{align*}\end{split}\]

Trainable Parameters

Nodes

Parameter

Shape

Symbol

Shape

\(E\)

\((V, \dMdl)\)

\(\PE\)

\((B, S_\max, \dMdl)\)

\(\TransEncLayer\)

\(\PE_t\)

\((B, \dMdl)\)

\(c\)

\((B, S')\)

\(c'\)

\((B, S_\max-1)\)

\(e_t\)

\((B, S, \dMdl)\)

\(h^\ell\)

\((B, S, \dMdl)\)

\(h_t^0\)

\((B, \dMdl)\)

\(\msk\)

\((B, S, S)\)

\(\msk_{i,j}\)

\((B)\)

\(x\)

\((B, S)\)

\(x_t\)

\((B)\)

\(y\)

\((B, S, V)\)

\(y_t\)

\((B, V)\)

The goal of optimization is to minimize the negative logliklihood of next token id \(x_{t+1}\) given \(y_t\). The prediction loss is defined to be the average negative logliklihood over \(x\) given \(y\).

\[\loss = \dfrac{-1}{S} \sum_{t = 1}^S \log \Pr(x_{t+1} \vert y_t).\]
  • \(y_t\) is the next token id prediction probability distribution over tokenizer’s vocabulary. We use inner product to calculate similarity scores over all token ids, and then use softmax to normalize similarity scores into probability range \([0, 1]\).

  • Model parameters in Transformer encoder language model are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) of uniform distribution are given as hyperparameters.

Parameters
  • d_ff (int, default: 1) – Number of hidden units \(\dFf\) in the 2-layer fully connected feed-forward network.

  • d_k (int, default: 1) – Number of key features \(d_k\) in each head.

  • d_model (int, default: 1) – Number of input / output features \(\dMdl\).

  • d_v (int, default: 1) – Number of value features \(d_v\) in each head.

  • init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

  • init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

  • kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.

  • label_smoothing (float, default: 0.0) – Smoothing applied on prediction target \(x_{t+1}\).

  • max_seq_len (int, default: 512) – Maximum length of the input sequence.

  • n_lyr (int, default: 1) – Number of Transformer encoder layers \(\nLyr\).

  • n_head (int, default: 1) – Number of attention heads \(\nHd\).

  • p (float, default: 0.0) – Dropout probability \(p\).

  • tknzr (BaseTknzr) – Tokenizer instance.

d_ff#

Number of hidden units \(\dFf\) in the 2-layer fully connected feed-forward network.

Type

int

d_k#

Number of key features \(d_k\) in each head.

Type

int

d_model#

Number of input / output features \(\dMdl\).

Type

int

d_v#

Number of value features \(d_v\) in each head.

Type

int

emb#

Token embedding lookup matrix. Use token ids to lookup token embeddings.

Type

torch.nn.Embedding

init_lower#

Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

Type

float

init_upper#

Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

Type

float

input_dp#

Dropout with probability \(p\) applied on the sum of token embeddings and position encodings.

Type

torch.nn.Dropout

label_smoothing#

Smoothing applied on prediction target \(x_{t+1}\).

Type

float

loss_fn#

Loss function to be optimized.

Type

torch.nn.CrossEntropyLoss

model_name#

CLI name of Transformer encoder is Transformer-encoder.

Type

ClassVar[str]

p#

Dropout probability \(p\).

Type

float

pos_enc#

Positional Encoding.

Type

lmp.model.PosEncLayer

stack_trans_enc#

TransEncLayer stacking layers. The number of stacking layers is equal to \(\nLyr\). Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).

Type

torch.nn.ModuleList

classmethod add_CLI_args(parser: ArgumentParser) None[source]#

Add transformer encoder language model hyperparameters to CLI arguments parser.

Parameters

parser (argparse.ArgumentParser) – CLI argument parser.

Return type

None

See also

lmp.script.train_model

Language model training script.

Examples

>>> import argparse
>>> import math
>>> from lmp.model import TransEnc
>>> parser = argparse.ArgumentParser()
>>> TransEnc.add_CLI_args(parser)
>>> args = parser.parse_args([
...   '--d_ff', '2',
...   '--d_k', '4',
...   '--d_model', '6',
...   '--d_v', '8',
...   '--init_lower', '-0.01',
...   '--init_upper', '0.01',
...   '--label_smoothing', '0.1',
...   '--n_head', '10',
...   '--n_lyr', '2',
...   '--p', '0.1',
... ])
>>> assert args.d_ff == 2
>>> assert args.d_k == 4
>>> assert args.d_model == 6
>>> assert args.d_v == 8
>>> assert math.isclose(args.init_lower, -0.01)
>>> assert math.isclose(args.init_upper, 0.01)
>>> assert math.isclose(args.label_smoothing, 0.1)
>>> assert args.n_head == 10
>>> assert args.n_lyr == 2
>>> assert math.isclose(args.p, 0.1)
cal_loss(batch_cur_tkids: Tensor, batch_next_tkids: Tensor, batch_prev_states: Optional[Tensor] = None) Tuple[Tensor, Tensor][source]#

Calculate language model prediction loss.

We use cross entropy loss as our training objective. This method is only used for training.

Parameters
  • batch_cur_tkids (torch.Tensor) – Batch current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.

  • batch_next_tkids (torch.Tensor) – Prediction target of each sample in the batch. batch_next_tkids has shape \((B, S)\) and dtype == torch.long.

  • batch_prev_states (Optional[torch.Tensor], default: None) – Batch of previous token ids \(c\). The tensor represent the batch of token ids used in the previous context. It has shape \((B, S')\) and dtype == torch.long. If given, it will be concatenated with batch_cur_tkids. Set to None to do nothing.

Returns

The first tensor in the tuple is the mini-batch cross-entropy loss. Loss tensor has shape \((1)\) and dtype == torch.float. The second tensor in the tuple is a batch of the token ids used in forward pass (we denoted it as \(c'\) in our definition). The second tensor has shape \((B, \min(S, S_\max-1))\) and dtype == torch.long.

Return type

tuple[torch.Tensor, torch.Tensor]

create_mask(batch_tkids: Tensor) Tensor[source]#

Create self-attention mask for batch_tkids.

Self-attention mask is created as follow:

  1. Create auto-regressive mask by masking everything above diagnoal. This is needed since input token at each time step can only see input tokens at previous time steps and itself.

  2. Create padding masks by masking every positions correspond to padding tokens. This is needed since paddings are meaningless.

Parameters

batch_tkids (torch.Tensor) – Batch of token ids with shape (B, S) and dtype == torch.long.

Returns

Auto-regressive self attention mask and padding mask. Returned tensor has shape (B, S, S) and dtype == torch.bool.

Return type

torch.Tensor

forward(batch_cur_tkids: Tensor, batch_prev_states: Optional[Tensor] = None) Tuple[Tensor, Tensor][source]#

Calculate next token id logits.

Logits were calculated based on previous hidden states and current input token ids. Use pred to convert logits into next token id probability distribution over tokenizer’s vocabulary. Use cal_loss to convert logits into next token id prediction loss. Below we describe the forward pass algorithm of Transformer encoder language model.

  1. Use token ids to lookup token embeddings with self.emb.

  2. Use sequence length to lookup positional encodings with self.pos_enc.

  3. Apply dropout to the sum of token embeddings and positional encodings.

  4. Feed the result into transformer encoder layer. We use teacher forcing in this step when perform training, i.e., inputs are directly given instead of generated by model.

  5. Feed the output of previous transformer encoder layer into next transformer encoder layer until all layers have been used once.

  6. Perform inner product on the output of the last transformer encoder layer and token embeddings to get similarity scores.

  7. Return similarity scores (logits).

Parameters
  • batch_cur_tkids (torch.Tensor) – Batch current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.

  • batch_prev_states (Optional[torch.Tensor], default: None) – Batch of previous token ids \(c\). The tensor represent the batch of token ids used in the previous context. It has shape \((B, S')\) and dtype == torch.long. If given, it will be concatenated with batch_cur_tkids. Set to None to do nothing.

Returns

The first tensor in the tuple is the batch of next token id logits with shape \((B, S, V)\) and dtype == torch.float. The second tensor in the tuple is a batch of the token ids used in forward pass (we denoted it as \(c'\) in our definition). The second tensor has shape \((B, \min(S, S_\max-1))\) and dtype == torch.long.

Return type

tuple[torch.Tensor, torch.Tensor]

params_init() None[source]#

Initialize model parameters.

All weights and biases are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\).

Return type

None

See also

params_init

Transformer encoder layer parameter initialization.

pred(batch_cur_tkids: Tensor, batch_prev_states: Optional[Tensor] = None) Tuple[Tensor, Tensor][source]#

Calculate next token id probability distribution over tokenizer’s vocabulary.

Probabilities were calculated based on previous hidden states and current input token id. This method must only be used for inference. No tensor graphs will be constructed and no gradients will be calculated.

Parameters
  • batch_cur_tkids (torch.Tensor) – Batch current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.

  • batch_prev_states (Optional[torch.Tensor], default: None) – Batch of previous token ids \(c\). The tensor represent the batch of token ids used in the previous context. It has shape \((B, S')\) and dtype == torch.long. If given, it will be concatenated with batch_cur_tkids. Set to None to do nothing.

Returns

The first tensor in the tuple is the batch of next token id probability distribution over the paired tokenizer’s vocabulary. Probability tensor has shape \((B, S, V)\) and dtype == torch.float. The second tensor in the tuple is a batch of the token ids used in forward pass (we denoted it as \(c'\) in our definition). The second tensor has shape \((B, \min(S, S_\max-1))\) and dtype == torch.long.

Return type

tuple[torch.Tensor, torch.Tensor]

class lmp.model._trans_enc.TransEncLayer(*, d_ff: int = 1, d_k: int = 1, d_model: int = 1, d_v: int = 1, init_lower: float = -0.1, init_upper: float = 0.1, n_head: int = 1, p: float = 0.0, **kwargs: Any)[source]#

Bases: Module

Transformer encoder layer 1.

  • Let \(B\) be mini-batch size.

  • Let \(S\) be the length of each sequence in a mini-batch.

  • Let \(\dMdl\) be the number of features per time step in each sequence.

  • Let \(x\) be a batch of sequences of features with shape \((B, S, \dMdl)\).

  • Let \(\msk\) be a batch of attention mask with shape \((B, S, S)\).

  • Let \(\nHd\) be the number of attention heads.

  • Let \(d_k\) be the number of key features in each attention head.

  • Let \(d_v\) be the number of value features in each attention head.

  • Let \(\dFf\) be the number of hidden units in the 2-layer fully connected feed-forward network.

  • Let \(p\) be the dropout probability.

Transformer encoder layer is defined as follow:

\[\begin{split}\begin{align*} & \algoProc{\TransEncLayer}(\msk, x) \\ & \indent{1} y_1 \algoEq \MultiHeadAttnLayer\pa{k \algoEq x, \msk \algoEq \msk, q \algoEq x, v \algoEq x} \\ & \indent{1} y_2 \algoEq \LayerNorm_1\pa{x + \drop{y_1}{p}} \\ & \indent{1} y_3 \algoEq W_2 \cdot \max\pa{\mathbf{0}, W_1 \cdot y_2 + b_1} + b_2 \\ & \indent{1} y_4 \algoEq \LayerNorm_2\pa{y_2 + \drop{y_3}{p}} \\ & \indent{1} \algoReturn y_4 \\ & \algoEndProc \end{align*}\end{split}\]

Trainable Parameters

Nodes

Parameter

Shape

Symbol

Shape

\(W_1\)

\((\dFf, \dMdl)\)

\(\mathbf{0}\)

\((B, S, \dFf)\)

\(W_2\)

\((\dMdl, \dFf)\)

\(\msk\)

\((B, S, S)\)

\(b_1\)

\((\dFf)\)

\(x\)

\((B, S, \dMdl)\)

\(b_2\)

\((\dMdl)\)

\(y_1\)

\((B, S, \dMdl)\)

\(\MultiHeadAttnLayer\)

\(y_2\)

\((B, S, \dMdl)\)

\(\LayerNorm_1\)

\(y_3\)

\((B, S, \dMdl)\)

\(\LayerNorm_2\)

\(y_4\)

\((B, S, \dMdl)\)

Model parameters in Transformer encoder layer are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) are given as hyperparameters.

Parameters
  • d_ff (int) – Number of hidden units \(\dFf\) in the 2-layer fully connected feed-forward network.

  • d_k (int, default: 1) – Number of key features \(d_k\) in each head.

  • d_model (int, default: 1) – Number of input / output features \(\dMdl\).

  • d_v (int, default: 1) – Number of value features \(d_v\) in each head.

  • init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

  • init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

  • kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.

  • n_head (int, default: 1) – Number of attention heads \(\nHd\).

  • p (float, default: 0.0) – Dropout probability \(p\).

d_ff#

Number of hidden units \(\dFf\) in the 2-layer fully connected feed-forward network.

Type

int

d_k#

Number of key features \(d_k\) in each head.

Type

int

d_model#

Number of input / output features \(\dMdl\).

Type

int

d_v#

Number of value features \(d_v\) in each head.

Type

int

ffn#

2-layer fully connected feed-forward network with parameters \(W_1, W_2, b_1, b_2\). Dropout with probability \(p\) is applied to output. Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).

Type

torch.nn.Sequential

init_lower#

Uniform distribution lower bound \(\init_l\) used to initialize model parameters.

Type

float

init_upper#

Uniform distribution upper bound \(\init_u\) used to initialize model parameters.

Type

float

ln_1#

Correspond to \(\LayerNorm_1\). Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).

Type

torch.nn.LayerNorm

ln_2#

Correspond to \(\LayerNorm_2\). Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).

Type

torch.nn.LayerNorm

mha#

Multi-head self attention layer. Multi-head attention is calculated through \(\MultiHeadAttnLayer\) and self-attention is achieved by giving identical input to query, key and vector. Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).

Type

~MultiHeadAttnLayer

mha_dp#

Perform dropout with probability \(p\) on the output of multi-head self attention. Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).

Type

torch.nn.Dropout

n_head#

Number of attention heads \(\nHd\).

Type

int

p#

Dropout probability \(p\).

Type

float

See also

MultiHeadAttnLayer

Multi-head attention layer.

forward(mask: Tensor, x: Tensor) Tensor[source]#

Calculate batch of hidden states for x.

Below we describe the forward pass algorithm of transformer encoder layer.

  1. Let x be a batch of sequences of features \(x\).

  2. Let mask be a batch of attention mask \(\msk\).

  3. Use self.mha to perform multi-head self attention on x and get \(y_1\).

  4. Use self.mha_dp to perform dropout on \(y_1\).

  5. Add \(x\) and \(y_1\) (with dropout applied) and use self.ln_1 to perform layer normalization on the addition result to get \(y_2\).

  6. Use self.ffn to perform 2-layer fully connected feed-forward network forward pass and get \(y_3\).

  7. Add \(y_2\) and \(y_3\) (with dropout applied) and use self.ln_2 to perform layer normalization on the addition result to get \(y_4\).

  8. Return \(y_4\).

Parameters
  • x (torch.Tensor) – Batch of sequences of features with shape \((B, S, \dMdl)\) and dtype == torch.float.

  • mask (torch.Tensor) – Batch of attention mask with shape \((B, S, S)\) and dtype == torch.bool. Set to true to mask attention at corresponding position.

Returns

Batch of sequences of output features \(y_4\) with shape \((B, S, \dMdl)\) and dtype == torch.float.

Return type

torch.Tensor

params_init() None[source]#

Initialize model parameters.

All weights and biases are initialized with uniform distribution \(\mathcal{U}\pa{\init_l, \init_u}\).

Return type

None

See also

params_init

Multi-head attention layer parameter initialization.

1(1,2,3,4)

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.