Language model base class#

class lmp.model.BaseModel(**kwargs: Any)[source]#

Bases: ABC, Module

Language model abstract base class.

Implement basic functionalities of language model, including training loss calculation, next token id prediction and parsing training arguments.

Let \(X = \set{x^1, x^2, \dots, x^B}\) be a mini-batch of token id lists with batch size \(B\). A token id list \(x \in X\) is defined as follow:

\[x = \pa{x_1, x_2, \dots, x_S, x_{S+1}}.\]

\(x\) has length \(S+1\).
\(x_t\) is the \(t\)-th time step of \(x\), the range of \(t\) is \(\set{1, \dots, S+1}\).
\(x_1\) is the token id of <bos>.
\(x_{S+1}\) is the token id of <eos>.
Each language model will be paired with one tokenizer. Let \(V\) be the size of the paired tokenizer’s vocabulary. Then \(x_t \in \set{1, \dots, V}\).

The training goal of a language model with parameter \(\theta\) is to find an optimal parameter \(\theta^{\star}\), such that when replace \(\theta\) with \(\theta^{\star}\), it maximizes the prediction probability of next token id \(x_{t+1}\) given \(x_1, \dots, x_t\):

\[\theta^{\star} = \arg\max_{\theta} \prod_{x \in X} \prod_{t = 1}^S P(x_{t+1} \vert x_1, \dots, x_t ; \theta)\]

Note that all token id lists in \(X\) have the same length \(S+1\) and \(t\) start with \(1\). Thus for each token id list \(x \in X\), the first \(S\) token ids are served as input, and the last \(S\) token ids are served as prediction target. There are only \(S\) positions contribute to loss.

Parameters: kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.

model_name#

CLI name of the model. Only used to parse CLI arguments.

Type: ClassVar[str]

abstract classmethod add_CLI_args(parser: ArgumentParser) → None[source]#

Add language model hyperparameters to CLI argument parser.

Parameters: parser (argparse.ArgumentParser) – CLI argument parser.
Return type: None

See also

lmp.script.train_model: Language model training script.

abstract cal_loss(batch_cur_tkids: Tensor, batch_next_tkids: Tensor, batch_prev_states: Optional[Any] = None) → Tuple[Tensor, Any][source]#

Calculate language model prediction loss.

Loss is defined as the next token id distribution difference between model output and the answer. Predicting next token is treated as a classification problem, where the number of classes equals to tokenizer’s vocabulary size. This method is only used for training.

Parameters

batch_cur_tkids (torch.Tensor) – Batch of current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.
batch_next_tkids (torch.Tensor) – Prediction target of each sample in the batch. batch_next_tkids has shape \((B, S)\) and dtype == torch.long.
batch_prev_states (Any, default: None) – Batch of previous calculation results. Set to None to use initial hidden states. Different models may have different hidden states structure.

Returns

The first item in the tuple is the mini-batch loss with shape \((1)\) and dtype == torch.float. The second item in the tuple represent the current hidden states. Different models may have different hidden states structure.

Return type

tuple[torch.Tensor, Any]

abstract forward(batch_cur_tkids: Tensor, batch_prev_states: Optional[Any] = None) → Tuple[Tensor, Any][source]#

Calculate next token id logits.

Logits were calculated based on previous hidden states and current input token id. Use pred to convert logits into next token id probability distribution over tokenizer’s vocabulary. Use cal_loss to convert logits into next token id prediction loss.

Parameters

batch_cur_tkids (torch.Tensor) – Batch of current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.
batch_prev_states (Any, default: None) – Batch of previous calculation results. Set to None to use initial hidden states. Different models may have different hidden states structure.

Returns

The first item in the tuple is the batch of next token id logits with shape \((B, S, V)\) and dtype == torch.float. The second item in the tuple represent the current hidden states. Different models may have different hidden states structure.

Return type

tuple[torch.Tensor, Any]

See also

enc: Source of token ids.

abstract params_init() → None[source]#

Initialize model parameters.

The ways and values to initialize a model are consider as hyperparameters. Different models may have different initialization sheme.

Return type: None

abstract pred(batch_cur_tkids: Tensor, batch_prev_states: Any = None) → Tuple[Tensor, Any][source]#

Calculate next token id probability distribution over tokenizer’s vocabulary.

Probability distribution is calculated based on previous hidden states and current input token id. This method is only used for inference. For training use cal_loss instead. No tensor graphs are constructed and no gradients are calculated.

Parameters

batch_cur_tkids (torch.Tensor) – Batch of current input token ids. batch_cur_tkids has shape \((B, S)\) and dtype == torch.long.
batch_prev_states (Any, default: None) – Batch of previous calculation results. Set to None to use initial hidden states. Different models may have different hidden states structure.

Returns

The first item in the tuple is the batch of next token id probability distributions over the tokenizer’s vocabulary. Probability tensor has shape \((B, S, V)\) and dtype == torch.float. The second item in the tuple represent the current hidden states. Different models may have different hidden states structure.

Return type

tuple[torch.Tensor, Any]