

A efficient way to calculate gradient of loss function with respect to each model parameter. See 1 for algorithm detail.

batch size#

Number of samples in a mini-batch.

BOS token#
BOS tokens#
begin-of-sequence token#
begin-of-sequence tokens#

BOS token is a special token which represent the begining of a sequence. In this project, BOS token represent the begining of a given text passage. BOS token is the first input of a language model. A language model is trained so that when receives a BOS token it must predict the most possible token that can appear at the start of a text passage.


When training language models, we save model parameters for later evaluation. We save model parameters every certain amount of step. The step number triggering saving process is called checkpoint. All checkpoints will be saved at your experiment path and named with format model-\, where \d+ means checkpoint step.

context window#
context window size#
context windows#

When performing time-series tasks, one sometimes need to deals with long sequences and cannot fit the whole sequences into memory. In this case we usually chuck a time-series data into subsequences (or frames) and train model on these subsequences. The fancy name for subsequences is context window, and the size of each subsequence is called context window size. When optimize RNN models, one usually optimize performance on subsequences instead of the whole sequence to ease the optimization problems like gradient explosion and gradient vanishing. Context windows can have overlaps. The number of overlapping tokens is called stride. Both context window size and stride are treated as preprocessing hyperparameters.

cross entropy#
cross entropy loss#
cross-entropy loss#

A loss function used to optimize classifiers. Suppose that we are performing a \(C\) classes classification task and a classifier produce a probability distribution \(P(x) = \pa{P_1(x), \dots, P_C(x)}\) given a input \(x\). If the ground truth correspond to \(x\) is \(y\) (note that \(y \in \set{1, \dots, C}\)), then cross entropy loss of \((x, y)\) is calculated as follow

\[\operatorname{CE}(x, y) = -\log P_y(x).\]

When \(P_y(x) \approx 1\), we have \(P_i(x) \approx 0\) for every other non-\(y\)-th class \(i\). Thus if one use cross entropy to optimize model, then one is maximize model’s loglikelihood.


A GPU library developed by Nvidia.


In our project, a dataset is consist of text samples.

See also


All available dataset.


Detokenization is just the oppsite operation of tokenization; it converts token list into text.

For example, when we use character tokenizer to detokenize ['a', 'b', 'c'] we get 'abc'; when we use whitespace tokenizer to detokenize ['a', 'b', 'c'] we get 'a b c'.

EOS token#
EOS tokens#
end-of-sequence token#
end-of-sequence tokens#

EOS token is a special token which represent the end of a sequence. In this project, EOS token represent the end of a given text passage. EOS token is the prediction target of the last input token of a language model. In this project, any tokens that follows EOS token can only be PAD tokens, and language models are not trained to produced meaningful output when seeing EOS tokens and PAD tokens.


An iteration which loops through each sample in a dataset is called an epoch.


May refer to tokenizer training experiment or language model training experiment. One usually train a tokenizer first and then train a language model.

experiment name#

Name of a particular experiment.

experiment path#

All experiment files are put under directory project_root/exp. If experiment name is my_exp, then experiment path is project_root/exp/my_exp.

forward pass#

The process which a model takes a input tensor and calculates with its parameters to achieve certain goal is called forward pass. In PyTorch framework this correspond to forward method of torch.nn.Module.

gradient descent#

If we have a loss function \(L\), then the direction of maximizing \(L\) with respect to a model parameter \(W\) is \(\nabla_W L\), the gradient of \(L\) with respect to \(W\). Thus to minimize \(L\), one has to go alone the opposite (negative) direction of gradient \(\nabla_W L\)

\[W_{\operatorname{new}} = W_{\operatorname{old}} - \eta \nabla_{W_{\operatorname{old}}} L.\]

Where \(\eta\) is learning rate. We expect to have \(L(W_{\operatorname{new}}) \leq L(W_{\operatorname{old}})\). To perform gradient descent, model need to first perform forward pass to obtain prediction loss. Currently the most efficient way to calculate gradients is by the algorithm back-propagation. After obtaining gradients we can then perform gradient descent.

gradient explosion#
gradient vanishing#

When perform gradient descent, if the calculated gradients have large norm (large in magnitude), then model parameters will also have large norm and results in values like Inf or NaN which makes model malfunctioning. This is called gradient explosion. On the other extreme, if the calculated gradients have norm closed to zero, then model parameters will be updated extremely slow. This is called gradient vanishing. These two cases happed all the times when optimize deep learning model by gradient descent, especially when optimizing RNN models.

One can use gradient clipping to enforce the magnitude of gradients fall within certain boundary. We use --max_norm in lmp.script.train_model to clip gradients. Gradient clipping can ease the gradient explosion but not vanishing. To solve gradient vanishing, one have to design specific model structure so that gradients of parameters closed to input layer is guarenteed to have almost identical scale. For example, the internal states of LSTM1997 is one such mechanism. Other mechanisms like residual connection 2 are also proposed.

hidden states#
initial hidden states#

When a model receives a time-series data, some of the early computation results can serve as future input and perform further computation. These computation results generated by the model on the fly are called hidden states. All hidden states for each time step have identical structure. This means we can use for-loops to calculate hidden states. By the nature of for-loops, we must provide initial hidden states to make for-loops work. This means initial hidden states may not be generated on the fly but previously defined instead. One usually set initial hidden states to zeros. One can also let initial hidden states be a part of model parameters. For simplicity, we set initial hidden states to zeros in this project.


A model can have the same structure with different number of layers and units. The specific number of layers and units are called hyperparameters. Hyperparameters are decided before training. In general, all experiment related parameters are hyperparameters. This includes cofiguration for evaluation and inference.

label smoothing#

When performing classification, one usually optimized classifier model to predict correct label for the corresponding input with high confidence. High confidence means that model will output \(0\) or \(1\) but nothing else. Some argue that optimizing model to have high confidence is hard, but is comparatively easier to optimize for slightly lower confidence. Optimizing model with slightly lower confidence is called label smoothing.

Precisely, suppose that we are given an input-ouput pairs \((x, y)\) and a possible anser range \(C\). Suppose also that \(P(x) = \pa{P_1(x), \dots, P_C(x)}\) is the classification probability distribution given input \(x\). When optimizing model with high confidence, we are expecting model to output the following probability distribution:

\[\begin{split}\forall i \in \set{1, \dots, C}, P_i(x) = \begin{cases} 1 & \text{if } i = y \\ 0 & \text{otherwise} \end{cases}.\end{split}\]

When optimizing model with label smoothing, one expects model to output the following probability distribution:

\[\begin{split}\forall i \in \set{1, \dots, C}, P_i(x) = \begin{cases} 1 - \epsilon & \text{if } i = y \\ \dfrac{\epsilon}{C - 1} & \text{otherwise} \end{cases}.\end{split}\]

The value \(\epsilon\) is given as hyperparameter and is typically a small positive number less than \(1\). Observe that the two formulas above are identical when \(\epsilon = 0\).

language model#
language models#

A language model is a model which calculates the probability of a given text is comming from human language. For example, the text “How are you?” is used in daily conversation and thus language model should output high probability or equivalently low perplexity. On the other hand, the text “You how are?” is meaningless and thus language model should output low probability or equivalently high perplexity.

More precisely, language model is an algorithm which inputs text and outputs probability. If a language model \(M\) has model parameters \(\theta\) and takes a input text \(x\), then we can interprete \(M(x; \theta)\) by the following rules

  • If \(M(x; \theta) \approx 1\), then \(x\) is very likely comming from human language.

  • If \(M(x; \theta) \approx 0\), then \(x\) is unlikely comming from human language.

The usual way to evaluate a language model is perplexity. In 1990s or earlier, language model are used to evaluate generated text from speech recognition and machine translation. More recently (after 2019), language models with huge number of parameters (like GPT 3 and BERT 4) have been shown to be useful for a lots of downstream NLP tasks, including Natural Language Understanding (NLU), Natural Language Generation (NLG), Question Answering (QA), cloze test, etc.

In this project we provide scripts for training language model (lmp.script.train_model), evaluating language model (lmp.script.eval_dset_ppl) and generating continual text using language model (lmp.script.gen_txt).

See also


All available scripts related to language model.


All available language model.

learning rate#

Gradients of loss with respect to model parameters is served as the direction of optimization. But large magnitude of gradients can make optimization hard 1. Thus one scale down gradients by multiplying a small number called learning rate. Setting learning rate to small number typically make optimization process longer but stable. Setting learning rate to large number typically make optimization process quicker but divergent. One rule to keep in mind is that one should use small learning rate when deal with huge number of model parameters.

log path#

All experiment log files are put under directory project_root/exp/log. If experiment name is my_exp, then experiment log path is project_root/exp/log/my_exp.

loss function#

A loss function is a function which is used to optimize and estimate the performance of model. The input of loss function is consist of model parameters and dataset samples. The output of loss function is called loss. In deep learning field one usually use two different functions for optimization and evaluation. For example, we use cross entropy loss to optimize language model and use perplexity to evaluate language model. A loss function must have a lower bound so that the optimization process has a chance to approximate the lower bound in finite number of times. Without lower bound one cannot know the performance of model by the loss it produces.


We split dataset into little sample chunks when (CUDA) memory cannot fit entire dataset. Each sample chunk is called a mini-batch. In deep learning field one usually use mini-batch to perform optimization instead of entire dataset.

model parameter#
model parameters#

A model is an algorithm which takes a input text and performs calculation with certain numbers. That certain numbers are called model parameters and their values are adjusted by optimization process.

See also


All available language models.

neural network#

PyTorch is a famous deep learning framework that provides lots of neural network utilities. In this project we use PyTorch to implement language models.


Many unicode characters can represent the same unicode character. For example, a unicode character can have full-width (e.g. ) and half-width (e.g. 1); Japanese puts smaller character after another syllable to make syllable before longer (e.g. アイウエオ and アイウエオ). Unicode normalization is a process which maps different representation of a unicode character to the same unicode, and NFKC is a way to achieve unicode normalization. It is a standard tool to preprocess text. See and for more details.


A process is called optimization or training if it takes a model \(M\) with parameter \(\theta\) and a loss function \(L\), continually adjust \(\theta\) to make \(L\) closed to its lower bound in a finite number of times. In the context of training neural network, optimization usually means to perform gradient descent.

PAD token#
PAD tokens#
padding token#
padding tokens#

PAD token is a special token which represent the padding tokens. If a mini-batch is consist of token sequences with different lengths, then such mini-batch will be appended with padding tokens so that token sequence have the same length. This is needed since we are perform parallel computation when training a language model. In this project, language models are not trained to produced meaningful output when seeing PAD tokens.


Perplexity is a way to evaluate language model. Given a text \(x\) consist of \(n\) tokens \(x = (x_1, x_2, \dots, x_n)\). For each \(i \in \set{1, \dots, n}\), the probability of next token being \(x_i\) preceeded by \(x_1, \dots, x_{i-1}\) is denoted as \(P(x_i|x_1, \dots, x_{i-1})\). The perplexity of \(x\), denoted as \(\operatorname{ppl}(x)\), is defined as follow

\[\begin{split}\begin{align*} \operatorname{ppl}(x) &= \pa{P(x_1, x_2, \dots, x_n)}^{-1/n} \\ &= \pa{P(x_1) \times P(x_2|x_1) \times P(x_3|x_1, x_2) \times \dots \times P(x_n|x_1, x_2, \dots, x_{n-1})}^{-1/n} \\ &= \pa{\prod_{i=1}^n P(x_i|x_1, \dots, x_{i-1})}^{-1/n} \\ &= 2^{\displaystyle \pa{\log_2 \pa{\prod_{i=1}^n P(x_i|x_1, \dots, x_{i-1})}^{-1/n}}} \\ &= 2^{\displaystyle \pa{\dfrac{-1}{n} \log_2 \pa{\prod_{i=1}^n P(x_i|x_1, \dots, x_{i-1})}}} \\ &= 2^{\displaystyle \pa{\dfrac{-1}{n} \sum_{i=1}^n \log_2 P(x_i|x_1, \dots, x_{i-1})}}. \end{align*}\end{split}\]

If all probabilities \(P(x_i|x_1, \dots, x_{i-1})\) are high, then perplexity is low. Thus we expect a well-trained language model to have low perplexity.


Abbreviation for “previously trained”.

recurrent neural network#

A neural network which some of its nodes in later layers connect to nodes in earlier layers.

See also


All available language models.


In our project a sample in a dataset is a text (character sequence).


A data structure which is ordered by integer index. We use sequence and time-series interchangably in this project.

Special token#
Special tokens#
special token#
special tokens#

A special token is an artifical token which is used to perform specific computation. In this project, special tokens are added to each sample in dataset when training language models.


Number of times a language model has been updated.


Number of overlapping tokens between two context windows. For example, suppose that we set context window size to 4, and set stride to 2. Then the text hello world will be splited into 5 character subsequences as follow:

lo w

A generalized version of matrix is called tensor. In our scenario we means stacking matrix. For example, if we have a list of matrix with shape \((2, 3)\) and there are \(5\) matrices in the list, then we can construct a tensor with shape \((5, 2, 3)\) by stacking all \(5\) matrices together. See PyTorch tensor torch.Tensor for more coding example.

text normalization#

In this project, the term text normalization is a three steps process on a given text:

  1. Perform NFKC normalization on the given text. For example, _1__2____3_ is normalized into _1__2____3_, where _ represents whitespace.

  2. Replace consequtive whitespaces with single whitespace. For example, _1__2___3_ will become _1_2_3_, where _ represents whitespace.

  3. Strip (remove) leading and trailing whitespaces. For example, _1_2_3_ will become 1_2_3, where _ represents whitespace.

One additional step may be applied depends on how you treat cases. If cases do not matter (which is called case-insensitive), then text normalization will transform all uppercase characters into lowercase characters. For example, ABC, AbC, aBc will all become abc. If case do matter (which is called case-sensitive), then no additional steps will to be applied.


A data structure which is ordered by integer index where indices are given the meaning of time. Common time-series data are sounds and natural languages. For example, the sentence “I like to eat apple.” can be treated as a character sequence where the first character (correspond to integer index 0) is “I”, the second character (correspond to integer index 1) is whitespace ” “, and the last character (correspond to integer 19) is “.”. We use sequence and time-series interchangably in this project.


Computer treats everything as number. To perform text related tasks, one usually chunks text into smaller pieces (called tokens) and convert each piece into number so that computer can easily process them.

For example, when we tokenize text 'abc 123' based on character, we get ['a', 'b', 'c', ' ', '1', '2', '3']; When we tokenize text 'abc 123' base on whitespace, we get ['abc', '123'].

The tool to chunk text into tokens is called tokenizer. How to tokenize is a research problem. There are many tokenizer have been proposed (e.g. STANZA, proposed by Stanford). In this project our tokenizers provide utilities including tokenization, text normalization and language model training formation.

See also


All available tokenizers.

token id#
token ids#

Since computer only compute numbers and tokens are text, we have to assign each token an integer number (called token id) and use token ids instead of tokens to perform computation. In our project, assigning each token an unique integer is called building vocabulary.


In this project, this term is used to refer to truncate a token list into specified length. This is the opposite operation of padding.

unknown token#
unknown tokens#

UNK token is a special token which represent the unknown token. If tokenizer encounter an out-of-vocabulary token when convert tokens into token ids, tokenizer will treat such token as UNK token and convert it to UNK token id. In this project, language models are trained to produced meaningful output when seeing UNK tokens. When encounter a UNK token, language model can only produce next token prediction based on tokens other than UNK.


A language model is paired with a tokenizer. How many tokens (characters, words, or else) a language model can learn is contrainted by model complexity and memory size. A token set learnt by a language model is called vocabulary. The number of tokens in a vocabulary is called vocabulary size. Tokens not in the vocabulary of a language model are called out-of-vocabulary tokens.


David Rumelhart, Geoffrey Hinton, and Ronald Williams. Learning representations by back-propagating errors. Nature, 1986. URL:, doi:10.1038/323533a0.


Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2016. URL:


Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, and others. Improving language understanding by generative pre-training. 2018. URL:


Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL:, doi:10.18653/v1/N19-1423.