lmp.model._base
#
Language model base class.
- class lmp.model._base.BaseModel(**kwargs: Any)[source]#
-
Language model abstract base class.
Implement basic functionalities of language model, including training loss calculation, next token id prediction and parsing training arguments.
Let \(X = \set{x^1, x^2, \dots, x^B}\) be a mini-batch of token id lists with batch size \(B\). A token id list \(x \in X\) is defined as follow:
\[x = \pa{x_1, x_2, \dots, x_S, x_{S+1}}.\]\(x\) has length \(S+1\).
\(x_t\) is the \(t\)-th time step of \(x\), the range of \(t\) is \(\set{1, \dots, S+1}\).
\(x_1\) is the token id of
<bos>
.\(x_{S+1}\) is the token id of
<eos>
.Each language model will be paired with one tokenizer. Let \(V\) be the size of the paired tokenizer’s vocabulary. Then \(x_t \in \set{1, \dots, V}\).
The training goal of a language model with parameter \(\theta\) is to find an optimal parameter \(\theta^{\star}\), such that when replace \(\theta\) with \(\theta^{\star}\), it maximizes the prediction probability of next token id \(x_{t+1}\) given \(x_1, \dots, x_t\):
\[\theta^{\star} = \arg\max_{\theta} \prod_{x \in X} \prod_{t = 1}^S P(x_{t+1} \vert x_1, \dots, x_t ; \theta)\]Note that all token id lists in \(X\) have the same length \(S+1\) and \(t\) start with \(1\). Thus for each token id list \(x \in X\), the first \(S\) token ids are served as input, and the last \(S\) token ids are served as prediction target. There are only \(S\) positions contribute to loss.
- Parameters
kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.
- abstract classmethod add_CLI_args(parser: ArgumentParser) None [source]#
Add language model hyperparameters to CLI argument parser.
- Parameters
parser (argparse.ArgumentParser) – CLI argument parser.
- Return type
None
See also
- lmp.script.train_model
Language model training script.
- abstract cal_loss(batch_cur_tkids: Tensor, batch_next_tkids: Tensor, batch_prev_states: Optional[Any] = None) Tuple[Tensor, Any] [source]#
Calculate language model prediction loss.
Loss is defined as the next token id distribution difference between model output and the answer. Predicting next token is treated as a classification problem, where the number of classes equals to tokenizer’s vocabulary size. This method is only used for training.
- Parameters
batch_cur_tkids (torch.Tensor) – Batch of current input token ids.
batch_cur_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_next_tkids (torch.Tensor) – Prediction target of each sample in the batch.
batch_next_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_prev_states (Any, default: None) – Batch of previous calculation results. Set to
None
to use initial hidden states. Different models may have different hidden states structure.
- Returns
The first item in the tuple is the mini-batch loss with shape \((1)\) and
dtype == torch.float
. The second item in the tuple represent the current hidden states. Different models may have different hidden states structure.- Return type
- abstract forward(batch_cur_tkids: Tensor, batch_prev_states: Optional[Any] = None) Tuple[Tensor, Any] [source]#
Calculate next token id logits.
Logits were calculated based on previous hidden states and current input token id. Use
pred
to convert logits into next token id probability distribution over tokenizer’s vocabulary. Usecal_loss
to convert logits into next token id prediction loss.- Parameters
batch_cur_tkids (torch.Tensor) – Batch of current input token ids.
batch_cur_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_prev_states (Any, default: None) – Batch of previous calculation results. Set to
None
to use initial hidden states. Different models may have different hidden states structure.
- Returns
The first item in the tuple is the batch of next token id logits with shape \((B, S, V)\) and
dtype == torch.float
. The second item in the tuple represent the current hidden states. Different models may have different hidden states structure.- Return type
tuple[torch.Tensor, Any]
See also
enc
Source of token ids.
- abstract params_init() None [source]#
Initialize model parameters.
The ways and values to initialize a model are consider as hyperparameters. Different models may have different initialization sheme.
- Return type
None
- abstract pred(batch_cur_tkids: Tensor, batch_prev_states: Any = None) Tuple[Tensor, Any] [source]#
Calculate next token id probability distribution over tokenizer’s vocabulary.
Probability distribution is calculated based on previous hidden states and current input token id. This method is only used for inference. For training use
cal_loss
instead. No tensor graphs are constructed and no gradients are calculated.- Parameters
batch_cur_tkids (torch.Tensor) – Batch of current input token ids.
batch_cur_tkids
has shape \((B, S)\) anddtype == torch.long
.batch_prev_states (Any, default: None) – Batch of previous calculation results. Set to
None
to use initial hidden states. Different models may have different hidden states structure.
- Returns
The first item in the tuple is the batch of next token id probability distributions over the tokenizer’s vocabulary. Probability tensor has shape \((B, S, V)\) and
dtype == torch.float
. The second item in the tuple represent the current hidden states. Different models may have different hidden states structure.- Return type
tuple[torch.Tensor, Any]