Transformer encoder#
- class lmp.model.MultiHeadAttnLayer(*, d_k: int = 1, d_model: int = 1, d_v: int = 1, init_lower: float = -0.1, init_upper: float = 0.1, n_head: int = 1, **kwargs: Any)[source]#
Bases:
Module
Multi-head attention 1 layer.
Let \(B\) be input mini-batch size.
Let \(S_q\) be the length of each query sequence.
Let \(S_k\) be the length of each key sequence.
Let \(\dMdl\) be the number of features per time step in each sequence.
Let \(q\) be a batch of query sequence with shape \((B, S_q, \dMdl)\).
Let \(k\) be a batch of key sequence with shape \((B, S_k, \dMdl)\).
Let \(v\) be a batch of value sequence with shape \((B, S_k, \dMdl)\).
Let \(\msk\) be a batch of attention mask with shape \((B, S_q, S_k)\).
Let \(\nHd\) be the number of attention heads.
Let \(d_k\) be the number of key features in each attention head.
Let \(d_v\) be the number of value features in each attention head.
Multi-head attention layer is defined as follow:
\[\begin{split}\begin{align*} & \algoProc{\MultiHeadAttnLayer}(k, \msk, q, v) \\ & \indent{1} S_q \algoEq q.\sz{1} \\ & \indent{1} S_k \algoEq k.\sz{1} \\ & \indent{1} \algoFor{h \in \set{1, \dots, \nHd}} \\ & \indent{2} \algoCmt{Get query vector for each head.} \\ & \indent{2} \algoFor{t \in \set{1, \dots, S_q}} \\ & \indent{3} q_t^h \algoEq W_{Q, h} \cdot q_t \\ & \indent{2} \algoEndFor \\ & \indent{2} Q^h \algoEq \cat{q_1^h, \dots, q_{S_q}^h} \\ & \indent{2} \algoCmt{Get key-value vectors for each head.} \\ & \indent{2} \algoFor{t \in \set{1, \dots, S_k}} \\ & \indent{3} k_t^h \algoEq W_{K, h} \cdot k_t \\ & \indent{3} v_t^h \algoEq W_{V, h} \cdot v_t \\ & \indent{2} \algoEndFor \\ & \indent{2} K^h \algoEq \cat{k_1^h, \dots, k_{S_k}^h} \\ & \indent{2} V^h \algoEq \cat{v_1^h, \dots, v_{S_k}^h} \\ & \indent{2} \algoCmt{Apply attention mask on similarity scores.} \\ & \indent{2} \Sim^h \algoEq \dfrac{Q^h \cdot \pa{K^h}^\top}{\sqrt{d_k}} \\ & \indent{2} \algoFor{i \in \set{1, \dots, S_q}} \\ & \indent{3} \algoFor{j \in \set{1, \dots, S_k}} \\ & \indent{4} \algoIf{\msk_{i,j} \algoIs \algoTrue} \\ & \indent{5} \Sim_{i,j}^h \algoEq -10^9 \\ & \indent{4} \algoEndIf \\ & \indent{3} \algoEndFor \\ & \indent{3} \attn_i^h \algoEq \sof{\Sim_{i,1}^h, \dots, \Sim_{i,S_k}^h} \\ & \indent{2} \algoEndFor \\ & \indent{2} \algoCmt{Get attention scores.} \\ & \indent{2} \attn^h \algoEq \cat{\attn_1^h, \dots, \attn_{S_q}^h} \\ & \indent{2} F^h \algoEq \attn^h \cdot V^h \\ & \indent{1} \algoEndFor \\ & \indent{1} F \algoEq \fla{F^1, \dots, F^{\nHd}} \\ & \indent{1} O \algoEq W_O \cdot F \\ & \indent{1} \algoReturn O \\ & \algoEndProc \end{align*}\end{split}\]Trainable Parameters
Nodes
Parameter
Shape
Symbol
Shape
\(W_{K,h}\)
\((d_k, \dMdl)\)
\(F\)
\((B, S_q, \nHd \times d_v)\)
\(W_O\)
\((\dMdl, \nHd \times d_v)\)
\(F^h\)
\((B, S_q, d_v)\)
\(W_{Q,h}\)
\((d_k, \dMdl)\)
\(K^h\)
\((B, S_k, d_k)\)
\(W_{V,h}\)
\((d_v, \dMdl)\)
\(O\)
\((B, S_q, \dMdl)\)
\(Q^h\)
\((B, S_q, d_k)\)
\(V^h\)
\((B, S_k, d_v)\)
\(\attn^h\)
\((B, S_q, S_k)\)
\(\attn_i^h\)
\((B, S_k)\)
\(k\)
\((B, S_k, \dMdl)\)
\(k_t\)
\((B, \dMdl)\)
\(k_t^h\)
\((B, d_k)\)
\(\msk\)
\((B, S_q, S_k)\)
\(\msk_{i,j}\)
\((B)\)
\(q\)
\((B, S_q, \dMdl)\)
\(q_t\)
\((B, \dMdl)\)
\(q_t^h\)
\((B, d_k)\)
\(\Sim^h\)
\((B, S_q, S_k)\)
\(\Sim_{i,j}^h\)
\((B)\)
\(v\)
\((B, S_k, \dMdl)\)
\(v_t\)
\((B, \dMdl)\)
\(v_t^h\)
\((B, d_v)\)
Model parameters in Multi-head attention layer are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) are given as hyperparameters.
- Parameters
d_k (int, default: 1) – Number of key features \(d_k\) in each head.
d_model (int, default: 1) – Number of input / output features \(\dMdl\).
d_v (int, default: 1) – Number of value features \(d_v\) in each head.
init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.
n_head (int, default: 1) – Number of attention heads \(\nHd\).
- fc_ff_f2o#
Fully connected feed-forward layer \(W_O\) which transform features to output. No biases are used. Input shape: \((B, S_q, \nHd \times d_v)\). Output shape: \((B, S_q, \dMdl)\).
- Type
- fc_ff_k2hk#
Fully connected feed-forward layer \(\pa{W_{K,1}, \dots, W_{K,\nHd}}\) which transform key vectors to heads. No biases are used. Input shape: \((B, S_k, \dMdl)\). Output shape: \((B, S_k, \nHd \times d_k)\).
- Type
- fc_ff_q2hq#
Fully connected feed-forward layer \(\pa{W_{Q,1}, \dots, W_{Q,\nHd}}\) which transform query vectors to heads. No biases are used. Input shape: \((B, S_q, \dMdl)\). Output shape: \((B, S_q, \nHd \times d_k)\).
- Type
- fc_ff_v2hv#
Fully connected feed-forward layer \(\pa{W_{V,1}, \dots, W_{V,\nHd}}\) which transform value vectors to heads. No biases are used. Input shape: \((B, S_k, \dMdl)\). Output shape: \((B, S_k, \nHd \times d_v)\).
- Type
- init_lower#
Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
- Type
- init_upper#
Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
- Type
- forward(k: Tensor, mask: Tensor, q: Tensor, v: Tensor) Tensor [source]#
Perform multi-head attention on query, key, value.
Below we describe the forward pass algorithm of multi-head attention layer.
Let
q
be a batch of sequences of query vectors \(q\).Let
k
be a batch of sequences of key vectors \(k\).Let
v
be a batch of sequences of value vectors \(v\).Let
mask
be a batch of attention mask \(\msk\).Let
q.size(1)
be sequence length \(S_q\).Let
k.size(1)
be sequence length \(S_k\).Use
self.fc_ff_q2hq
to transform query vectors into multi-head query vectors \(Q^1, \dots, Q^{\nHd}\).Use
self.fc_ff_k2hk
to transform query vectors into multi-head query vectors \(K^1, \dots, K^{\nHd}\).Use
self.fc_ff_v2hv
to transform query vectors into multi-head query vectors \(V^1, \dots, V^{\nHd}\).Use \(Q^1, \dots, Q^{\nHd}\) and \(K^1, \dots, K^{\nHd}\) to calculate similarity scores \(\Sim^1, \dots, \Sim^{\nHd}\).
Use
mask
to mask similarity scores \(\Sim^1, \dots, \Sim^{\nHd}\).Use softmax to transform similarity scores \(\Sim^1, \dots, \Sim^{\nHd}\) into attention scores \(\attn^1, \dots, \attn^{\nHd}\).
Use attention scores \(\attn^1, \dots, \attn^{\nHd}\) and \(V^1, \dots, V^{\nHd}\) to calculate hidden features \(F^1, \dots, F^{\nHd}\).
Use \(W_O\) and hidden features \(F^1, \dots, F^{\nHd}\) to calculate output \(O\).
Return \(O\).
- Parameters
k (torch.Tensor) – Batch of sequences of key vectors with shape \((B, S_k, \dMdl)\) and
dtype == torch.float
.mask (torch.Tensor) – Batch of attention mask with shape \((B, S_q, S_k)\) and
dtype == torch.bool
. Set to true to mask attention at corresponding position.q (torch.Tensor) – Batch of sequences of query vectors with shape \((B, S_q, \dMdl)\) and
dtype == torch.float
.v (torch.Tensor) – Batch of sequences of key vectors with shape \((B, S_k, \dMdl)\) and
dtype == torch.float
.
- Returns
Batch output features \(O\) with shape \((B, S_q, \dMdl)\) and
dtype == torch.float
.- Return type
- class lmp.model.PosEncLayer(*, d_emb: int = 1, max_seq_len: int = 512, **kwargs: Any)[source]#
Bases:
Module
Positional Encoding 1.
Let \(S\) be the lookup sequence length.
Let \(\dEmb\) be the dimension of positional encodings.
Positional encodings is defined as follow:
\[\begin{split}\begin{align*} & \algoProc{\PosEncLayer}\pa{S} \\ & \indent{1} \algoFor{\pos \in \set{1, \dots, S}} \\ & \indent{2} \algoFor{i \in \set{1, \dots, \dEmb}} \\ & \indent{3} \algoIf{i \text{ is even}} \\ & \indent{4} \PE_{(\pos,i)} \algoEq \sin\pa{\dfrac{\pos}{10000^{i / \dEmb}}} \\ & \indent{3} \algoElse \\ & \indent{4} \PE_{(\pos,i)} \algoEq \cos\pa{\dfrac{\pos}{10000^{i / \dEmb}}} \\ & \indent{3} \algoEndIf \\ & \indent{2} \algoEndFor \\ & \indent{1} \algoEndFor \\ & \indent{1} \algoReturn \PE \\ & \algoEndProc \end{align*}\end{split}\]Trainable Parameters
Nodes
Parameter
Shape
Symbol
Shape
\(\PE_{(\pos,i)}\)
\((1)\)
\(\PE\)
\((1, S, \dEmb)\)
- Parameters
- pe#
Positional encoding lookup table.
- Type
- class lmp.model.TransEnc(*, d_ff: int = 1, d_k: int = 1, d_model: int = 1, d_v: int = 1, init_lower: float = -0.1, init_upper: float = 0.1, label_smoothing: float = 0.0, max_seq_len: int = 512, n_head: int = 1, n_lyr: int = 1, p: float = 0.0, tknzr: BaseTknzr, **kwargs: Any)[source]#
Bases:
BaseModel
Transformer encoder 1 language model.
Let \(x\) be batch of token ids with batch size \(B\) and per sequence length \(S\).
Let \(c\) be previous batch of token ids (previous context window) with shape \((B, S')\). Note that \(c\) can be empty.
Let \(S_\max\) be the maximum sequence length a model can deal with.
When \(c\) is empty, the constraint \(S \leq S_\max\) must be satisfied.
When \(c\) is not empty, the constraint \(S + S' \leq S_\max\) must be satisfied.
Let \(V\) be the vocabulary size of the paired tokenizer. Each token id represents an unique token, i.e., \(x_t \in \set{1, \dots, V}\).
Let \(E\) be the token embedding lookup table.
Let \(\dMdl\) be the dimension of token embeddings.
Let \(e_t\) be the token embedding correspond to token id \(x_t\).
Token embeddings have dropout probability \(p\).
Let \(\PE\) be positional encoding layer.
Let \(\PE_t\) be the positional encoding at the \(t\) th position.
The dimension of positional encodings is \(\dMdl\).
Let \(\nLyr\) be the number of transformer encoder layers.
Let \(h^\ell\) be the output of the \(\ell\) th transformer encoder layer.
Transformer encoder language model is defined as follow:
\[\begin{split}\begin{align*} & \algoProc{\TransEnc}\pa{x, c} \\ & \indent{1} \algoIf{c \text{ is not empty}} \\ & \indent{2} x \algoEq \cat{x, c} \\ & \indent{1} \algoEndIf \\ & \indent{1} S \algoEq x.\sz{1} \\ & \indent{1} \algoCmt{Create attention mask.} \\ & \indent{1} \algoFor{i \in \set{1, \dots, S}} \\ & \indent{2} \algoFor{j \in \set{1, \dots, S}} \\ & \indent{3} \algoIf{x_i \algoIs \text{padding}} \\ & \indent{4} \msk_{i,j} \algoEq \algoTrue \\ & \indent{3} \algoElseIf{i \leq j} \\ & \indent{4} \msk_{i,j} \algoEq \algoFalse \\ & \indent{3} \algoElse \\ & \indent{4} \msk_{i,j} \algoEq \algoTrue \\ & \indent{3} \algoEndIf \\ & \indent{2} \algoEndFor \\ & \indent{1} \algoEndFor \\ & \indent{1} \algoCmt{Lookup token embedding and positional encoding.} \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} e_t \algoEq (x_t)\text{-th row of } E \text{ but treated as column vector} \\ & \indent{2} h_t^0 \algoEq \drop{e_t + \PE_t}{p} \\ & \indent{1} \algoEndFor \\ & \indent{1} h^0 \algoEq \cat{h_1^0, \dots, h_S^0} \\ & \indent{1} \algoCmt{Perform forward pass on stacking Transformer encoder layers} \\ & \indent{1} \algoFor{\ell \in \set{1, \dots, \nLyr}} \\ & \indent{2} h^\ell \algoEq \TransEncLayer\pa{ k \algoEq h^{\ell-1}, \msk \algoEq \msk, q \algoEq h^{\ell-1}, v \algoEq h^{\ell-1} } \\ & \indent{1} \algoEndFor \\ & \indent{1} \algoFor{t \in \set{1, \dots, S}} \\ & \indent{2} y_t \algoEq \sof{E \cdot h_t^{\nLyr}} \\ & \indent{1} \algoEndFor \\ & \indent{1} y \algoEq \cat{y_1, \dots, y_S} \\ & \indent{1} c' \algoEq \cat{x_{\max\pa{1, S - (S_\max-2)}}, \dots, x_S} \\ & \indent{1} \algoReturn \pa{y, c'} \\ & \algoEndProc \end{align*}\end{split}\]Trainable Parameters
Nodes
Parameter
Shape
Symbol
Shape
\(E\)
\((V, \dMdl)\)
\(\PE\)
\((B, S_\max, \dMdl)\)
\(\TransEncLayer\)
\(\PE_t\)
\((B, \dMdl)\)
\(c\)
\((B, S')\)
\(c'\)
\((B, S_\max-1)\)
\(e_t\)
\((B, S, \dMdl)\)
\(h^\ell\)
\((B, S, \dMdl)\)
\(h_t^0\)
\((B, \dMdl)\)
\(\msk\)
\((B, S, S)\)
\(\msk_{i,j}\)
\((B)\)
\(x\)
\((B, S)\)
\(x_t\)
\((B)\)
\(y\)
\((B, S, V)\)
\(y_t\)
\((B, V)\)
The goal of optimization is to minimize the negative logliklihood of next token id \(x_{t+1}\) given \(y_t\). The prediction loss is defined to be the average negative logliklihood over \(x\) given \(y\).
\[\loss = \dfrac{-1}{S} \sum_{t = 1}^S \log \Pr(x_{t+1} \vert y_t).\]\(y_t\) is the next token id prediction probability distribution over tokenizer’s vocabulary. We use inner product to calculate similarity scores over all token ids, and then use softmax to normalize similarity scores into probability range \([0, 1]\).
Model parameters in Transformer encoder language model are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) of uniform distribution are given as hyperparameters.
- Parameters
d_ff (int, default: 1) – Number of hidden units \(\dFf\) in the 2-layer fully connected feed-forward network.
d_k (int, default: 1) – Number of key features \(d_k\) in each head.
d_model (int, default: 1) – Number of input / output features \(\dMdl\).
d_v (int, default: 1) – Number of value features \(d_v\) in each head.
init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.
label_smoothing (float, default: 0.0) – Smoothing applied on prediction target \(x_{t+1}\).
max_seq_len (int, default: 512) – Maximum length of the input sequence.
n_lyr (int, default: 1) – Number of Transformer encoder layers \(\nLyr\).
n_head (int, default: 1) – Number of attention heads \(\nHd\).
p (float, default: 0.0) – Dropout probability \(p\).
tknzr (BaseTknzr) – Tokenizer instance.
- emb#
Token embedding lookup matrix. Use token ids to lookup token embeddings.
- Type
- init_lower#
Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
- Type
- init_upper#
Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
- Type
- input_dp#
Dropout with probability \(p\) applied on the sum of token embeddings and position encodings.
- Type
- loss_fn#
Loss function to be optimized.
- pos_enc#
Positional Encoding.
- stack_trans_enc#
TransEncLayer
stacking layers. The number of stacking layers is equal to \(\nLyr\). Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).- Type
- classmethod add_CLI_args(parser: ArgumentParser) None [source]#
Add transformer encoder language model hyperparameters to CLI arguments parser.
- Parameters
parser (argparse.ArgumentParser) – CLI argument parser.
- Return type
None
See also
- lmp.script.train_model
Language model training script.
Examples
>>> import argparse >>> import math >>> from lmp.model import TransEnc >>> parser = argparse.ArgumentParser() >>> TransEnc.add_CLI_args(parser) >>> args = parser.parse_args([ ... '--d_ff', '2', ... '--d_k', '4', ... '--d_model', '6', ... '--d_v', '8', ... '--init_lower', '-0.01', ... '--init_upper', '0.01', ... '--label_smoothing', '0.1', ... '--n_head', '10', ... '--n_lyr', '2', ... '--p', '0.1', ... ]) >>> assert args.d_ff == 2 >>> assert args.d_k == 4 >>> assert args.d_model == 6 >>> assert args.d_v == 8 >>> assert math.isclose(args.init_lower, -0.01) >>> assert math.isclose(args.init_upper, 0.01) >>> assert math.isclose(args.label_smoothing, 0.1) >>> assert args.n_head == 10 >>> assert args.n_lyr == 2 >>> assert math.isclose(args.p, 0.1)
- cal_loss(batch_cur_tkids: Tensor, batch_next_tkids: Tensor, batch_prev_states: Optional[Tensor] = None) Tuple[Tensor, Tensor] [source]#
Calculate language model prediction loss.
We use cross entropy loss as our training objective. This method is only used for training.
- Parameters
batch_cur_tkids (torch.Tensor) – Batch current input token ids.
batch_cur_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_next_tkids (torch.Tensor) – Prediction target of each sample in the batch.
batch_next_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_prev_states (Optional[torch.Tensor], default: None) – Batch of previous token ids \(c\). The tensor represent the batch of token ids used in the previous context. It has shape \((B, S')\) and
dtype == torch.long
. If given, it will be concatenated withbatch_cur_tkids
. Set toNone
to do nothing.
- Returns
The first tensor in the tuple is the mini-batch cross-entropy loss. Loss tensor has shape \((1)\) and
dtype == torch.float
. The second tensor in the tuple is a batch of the token ids used in forward pass (we denoted it as \(c'\) in our definition). The second tensor has shape \((B, \min(S, S_\max-1))\) anddtype == torch.long
.- Return type
- create_mask(batch_tkids: Tensor) Tensor [source]#
Create self-attention mask for
batch_tkids
.Self-attention mask is created as follow:
Create auto-regressive mask by masking everything above diagnoal. This is needed since input token at each time step can only see input tokens at previous time steps and itself.
Create padding masks by masking every positions correspond to padding tokens. This is needed since paddings are meaningless.
- Parameters
batch_tkids (torch.Tensor) – Batch of token ids with shape
(B, S)
anddtype == torch.long
.- Returns
Auto-regressive self attention mask and padding mask. Returned tensor has shape
(B, S, S)
anddtype == torch.bool
.- Return type
- forward(batch_cur_tkids: Tensor, batch_prev_states: Optional[Tensor] = None) Tuple[Tensor, Tensor] [source]#
Calculate next token id logits.
Logits were calculated based on previous hidden states and current input token ids. Use
pred
to convert logits into next token id probability distribution over tokenizer’s vocabulary. Usecal_loss
to convert logits into next token id prediction loss. Below we describe the forward pass algorithm of Transformer encoder language model.Use token ids to lookup token embeddings with
self.emb
.Use sequence length to lookup positional encodings with
self.pos_enc
.Apply dropout to the sum of token embeddings and positional encodings.
Feed the result into transformer encoder layer. We use teacher forcing in this step when perform training, i.e., inputs are directly given instead of generated by model.
Feed the output of previous transformer encoder layer into next transformer encoder layer until all layers have been used once.
Perform inner product on the output of the last transformer encoder layer and token embeddings to get similarity scores.
Return similarity scores (logits).
- Parameters
batch_cur_tkids (torch.Tensor) – Batch current input token ids.
batch_cur_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_prev_states (Optional[torch.Tensor], default: None) – Batch of previous token ids \(c\). The tensor represent the batch of token ids used in the previous context. It has shape \((B, S')\) and
dtype == torch.long
. If given, it will be concatenated withbatch_cur_tkids
. Set toNone
to do nothing.
- Returns
The first tensor in the tuple is the batch of next token id logits with shape \((B, S, V)\) and
dtype == torch.float
. The second tensor in the tuple is a batch of the token ids used in forward pass (we denoted it as \(c'\) in our definition). The second tensor has shape \((B, \min(S, S_\max-1))\) anddtype == torch.long
.- Return type
- params_init() None [source]#
Initialize model parameters.
All weights and biases are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\).
- Return type
None
See also
params_init
Transformer encoder layer parameter initialization.
- pred(batch_cur_tkids: Tensor, batch_prev_states: Optional[Tensor] = None) Tuple[Tensor, Tensor] [source]#
Calculate next token id probability distribution over tokenizer’s vocabulary.
Probabilities were calculated based on previous hidden states and current input token id. This method must only be used for inference. No tensor graphs will be constructed and no gradients will be calculated.
- Parameters
batch_cur_tkids (torch.Tensor) – Batch current input token ids.
batch_cur_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_prev_states (Optional[torch.Tensor], default: None) – Batch of previous token ids \(c\). The tensor represent the batch of token ids used in the previous context. It has shape \((B, S')\) and
dtype == torch.long
. If given, it will be concatenated withbatch_cur_tkids
. Set toNone
to do nothing.
- Returns
The first tensor in the tuple is the batch of next token id probability distribution over the paired tokenizer’s vocabulary. Probability tensor has shape \((B, S, V)\) and
dtype == torch.float
. The second tensor in the tuple is a batch of the token ids used in forward pass (we denoted it as \(c'\) in our definition). The second tensor has shape \((B, \min(S, S_\max-1))\) anddtype == torch.long
.- Return type
- class lmp.model.TransEncLayer(*, d_ff: int = 1, d_k: int = 1, d_model: int = 1, d_v: int = 1, init_lower: float = -0.1, init_upper: float = 0.1, n_head: int = 1, p: float = 0.0, **kwargs: Any)[source]#
Bases:
Module
Transformer encoder layer 1.
Let \(B\) be mini-batch size.
Let \(S\) be the length of each sequence in a mini-batch.
Let \(\dMdl\) be the number of features per time step in each sequence.
Let \(x\) be a batch of sequences of features with shape \((B, S, \dMdl)\).
Let \(\msk\) be a batch of attention mask with shape \((B, S, S)\).
Let \(\nHd\) be the number of attention heads.
Let \(d_k\) be the number of key features in each attention head.
Let \(d_v\) be the number of value features in each attention head.
Let \(\dFf\) be the number of hidden units in the 2-layer fully connected feed-forward network.
Let \(p\) be the dropout probability.
Transformer encoder layer is defined as follow:
\[\begin{split}\begin{align*} & \algoProc{\TransEncLayer}(\msk, x) \\ & \indent{1} y_1 \algoEq \MultiHeadAttnLayer\pa{k \algoEq x, \msk \algoEq \msk, q \algoEq x, v \algoEq x} \\ & \indent{1} y_2 \algoEq \LayerNorm_1\pa{x + \drop{y_1}{p}} \\ & \indent{1} y_3 \algoEq W_2 \cdot \max\pa{\mathbf{0}, W_1 \cdot y_2 + b_1} + b_2 \\ & \indent{1} y_4 \algoEq \LayerNorm_2\pa{y_2 + \drop{y_3}{p}} \\ & \indent{1} \algoReturn y_4 \\ & \algoEndProc \end{align*}\end{split}\]Trainable Parameters
Nodes
Parameter
Shape
Symbol
Shape
\(W_1\)
\((\dFf, \dMdl)\)
\(\mathbf{0}\)
\((B, S, \dFf)\)
\(W_2\)
\((\dMdl, \dFf)\)
\(\msk\)
\((B, S, S)\)
\(b_1\)
\((\dFf)\)
\(x\)
\((B, S, \dMdl)\)
\(b_2\)
\((\dMdl)\)
\(y_1\)
\((B, S, \dMdl)\)
\(\MultiHeadAttnLayer\)
\(y_2\)
\((B, S, \dMdl)\)
\(\LayerNorm_1\)
\(y_3\)
\((B, S, \dMdl)\)
\(\LayerNorm_2\)
\(y_4\)
\((B, S, \dMdl)\)
Model parameters in Transformer encoder layer are initialized with uniform distribution \(\mathcal{U}(\init_l, \init_u)\). The lower bound \(\init_l\) and upper bound \(\init_u\) are given as hyperparameters.
- Parameters
d_ff (int) – Number of hidden units \(\dFf\) in the 2-layer fully connected feed-forward network.
d_k (int, default: 1) – Number of key features \(d_k\) in each head.
d_model (int, default: 1) – Number of input / output features \(\dMdl\).
d_v (int, default: 1) – Number of value features \(d_v\) in each head.
init_lower (float, default: -0.1) – Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
init_upper (float, default: 0.1) – Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.
n_head (int, default: 1) – Number of attention heads \(\nHd\).
p (float, default: 0.0) – Dropout probability \(p\).
- ffn#
2-layer fully connected feed-forward network with parameters \(W_1, W_2, b_1, b_2\). Dropout with probability \(p\) is applied to output. Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).
- Type
- init_lower#
Uniform distribution lower bound \(\init_l\) used to initialize model parameters.
- Type
- init_upper#
Uniform distribution upper bound \(\init_u\) used to initialize model parameters.
- Type
- ln_1#
Correspond to \(\LayerNorm_1\). Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).
- Type
- ln_2#
Correspond to \(\LayerNorm_2\). Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).
- Type
- mha#
Multi-head self attention layer. Multi-head attention is calculated through \(\MultiHeadAttnLayer\) and self-attention is achieved by giving identical input to query, key and vector. Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).
- Type
~MultiHeadAttnLayer
- mha_dp#
Perform dropout with probability \(p\) on the output of multi-head self attention. Input shape: \((B, S, \dMdl)\). Output shape: \((B, S, \dMdl)\).
- Type
See also
MultiHeadAttnLayer
Multi-head attention layer.
- forward(mask: Tensor, x: Tensor) Tensor [source]#
Calculate batch of hidden states for
x
.Below we describe the forward pass algorithm of transformer encoder layer.
Let
x
be a batch of sequences of features \(x\).Let
mask
be a batch of attention mask \(\msk\).Use
self.mha
to perform multi-head self attention onx
and get \(y_1\).Use
self.mha_dp
to perform dropout on \(y_1\).Add \(x\) and \(y_1\) (with dropout applied) and use
self.ln_1
to perform layer normalization on the addition result to get \(y_2\).Use
self.ffn
to perform 2-layer fully connected feed-forward network forward pass and get \(y_3\).Add \(y_2\) and \(y_3\) (with dropout applied) and use
self.ln_2
to perform layer normalization on the addition result to get \(y_4\).Return \(y_4\).
- Parameters
x (torch.Tensor) – Batch of sequences of features with shape \((B, S, \dMdl)\) and
dtype == torch.float
.mask (torch.Tensor) – Batch of attention mask with shape \((B, S, S)\) and
dtype == torch.bool
. Set to true to mask attention at corresponding position.
- Returns
Batch of sequences of output features \(y_4\) with shape \((B, S, \dMdl)\) and
dtype == torch.float
.- Return type
- params_init() None [source]#
Initialize model parameters.
All weights and biases are initialized with uniform distribution \(\mathcal{U}\pa{\init_l, \init_u}\).
- Return type
None
See also
params_init
Multi-head attention layer parameter initialization.
- 1(1,2,3,4)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.