Scripts#

Overview#

In this project, we provide a series of scripts for one to train a language model. Scripts are categorized into three groups: dataset-related, tokenizer-related and language model-related scripts. Dataset-related group has only one script lmp.script.sample_dset. Tokenizer-related group has 2 scripts, lmp.script.train_tknzr and lmp.script.tknz_txt. The rest scripts belong to language-model-related group. One should first execute dataset-related script, then tokenizer-related scripts, and finally language-model-related scripts.

Sample from dataset#

One can use lmp.script.sample_dset to get a glimpse of dataset samples. For example, one can sample the demo dataset lmp.dset.DemoDset as follow:

python -m lmp.script.sample_dset demo

One can sample different dataset using different dataset names:

python -m lmp.script.sample_dset wiki-text-2

Sampling is always done on the default version of a dataset. To specify different version, one use --ver arguments with the desired version:

python -m lmp.script.sample_dset demo --ver valid

There are many samples in a dataset. The default sample is the 0th sample in a dataset. To see different sample, one use --idx with sample index other than 0:

python -m lmp.script.sample_dset demo --idx 1 --ver valid

See also

lmp.dset: All available datasets.

Train a tokenizer#

One can use the script lmp.script.train_tknzr to create an empty tokenizer and build tokenizer’s vocabulary on top of a dataset. For example, one can train a whitespace tokenizer lmp.tknzr.WsTknzr on the training set of the dataset lmp.dset.WikiText2Dset:

python -m lmp.script.train_tknzr whitespace \
  --dset_name wiki-text-2 \
  --exp_name my_tknzr_exp \
  --max_vocab 10000 \
  --min_count 0 \
  --ver train

In the above example, we use whitespace as the first argument to specify that we want to train a whitespace tokenizer. An empty whitespace tokenizer is created with the arguments is_uncased=False, max_vocab=10 and min_count=2. We use --dset_name wiki-text-2 and --ver train to specify that we want to build tokenizer’s vocabulary on top of the training set of wiki-text-2 dataset. We use the argument --exp_name to name our tokenizer training experiment as my_tknzr_exp. The tokenizer training results will be saved under the experiment path project_root/exp/my_tknzr_exp.

One can decide how many tokens to include in a tokenizer’s vocabulary. The parameter --max_vocab is the maximum number of tokens to be included in a tokenizer’s vocabulary. When setting --max_vocab -1, one can have unlimited number (of course limited by the memory size) of tokens in a tokenizer’s vocabulary. The following example results in vocabulary size around 33000:

python -m lmp.script.train_tknzr whitespace \
  --dset_name wiki-text-2 \
  --exp_name my_tknzr_exp \
  --max_vocab -1 \
  --min_count 0 \
  --ver train

Sometimes there are tokens which occur only one times or a few. These tokens are usually named entities or even worse, typos. One can filter out these tokens by deciding the minimum occurrence count for a token to be included in a tokenizer’s vocabulary. The parameter --min_count serve this purpose. When setting --min_count 0 no filtering will be performed. The following example results in vocabulary size around 14000:

python -m lmp.script.train_tknzr whitespace \
  --dset_name wiki-text-2 \
  --exp_name my_tknzr_exp \
  --max_vocab -1 \
  --min_count 10 \
  --ver train

Same character sequence with different cases (for example, Apple and apple) will be treated as different tokens. One can make tokenizer case-insensitive using the argument --is_uncased. The following example results in vocabulary size around 13000:

python -m lmp.script.train_tknzr whitespace \
  --dset_name wiki-text-2 \
  --exp_name my_tknzr_exp \
  --is_uncased \
  --max_vocab -1 \
  --min_count 10 \
  --ver train

See also

lmp.tknzr: All available tokenizers.

Tokenize text#

After training a tokenizer, one can use the pre-trained tokenizer to tokenize text. For example, following the examples in the previous section, we can tokenize text hello world into ['hello', 'world']:

python -m lmp.script.tknz_txt --exp_name my_tknzr_exp --txt "hello world"

In the above example, we use --exp_name my_tknzr_exp to load the pre-trained tokenizer. We provide the argument --txt "hello world" to tokenize the character sequence "hello world".

Train a language model#

One can use a language model to generate continual text on the given text. Before that, one have to first optimize a language model’s loss function on a dataset. To perform optimization, one can use the language model training script lmp.script.train_model. For example, we can train a LSTM (2000 version) language model LSTM2000 on the training set of wiki-text-2 dataset WikiText2Dset.

python -m lmp.script.train_model LSTM-2000 \
  --batch_size 64 \
  --beta1 0.9 \
  --beta2 0.999 \
  --ckpt_step 1000 \
  --d_blk 1 \
  --d_emb 1000 \
  --dset_name wiki-text-2 \
  --eps 1e-8 \
  --exp_name my_model_exp \
  --init_fb 1.0 \
  --init_ib -1.0 \
  --init_lower -0.1 \
  --init_ob -1.0 \
  --init_upper 0.1 \
  --label_smoothing 0.0 \
  --log_step 200 \
  --lr 1e-4 \
  --max_norm 1 \
  --max_seq_len 32 \
  --n_blk 1000 \
  --n_lyr 1 \
  --p_emb 0.1 \
  --p_hid 0.1 \
  --stride 32 \
  --tknzr_exp_name my_tknzr_exp \
  --ver train \
  --total_step 10000 \
  --warmup_step 5000 \
  --weight_decay 1e-2

Language model arguments#

The first argument in the above example is the name of a language model. Language models have different structure. One can see a specific model hyperparameters using -h arguments. For example, hyperparameter of LSTM2000 includes --d_emb, --d_blk, --init_fb, --init_ib, --init_lower, --init_ob, --init_upper, --label_smoothing, --n_blk, -n_lyr, --p_emb and --p_hid. One can see the hyperparameter of LSTM2000 using the following script:

python -m lmp.script.train_model LSTM-2000 -h

Text processing arguments#

Every language model is paired with a tokenizer. The paired tokenizer will share its vocabulary with a language model. In the example above, we set --tknzr_exp_name my_tknzr_exp to use a pre-trained tokenizer with experiment name my_tknzr_exp. One usually use a tokenizer trained on the same dataset and the same version. Thus the above example use --dset_name wiki-text-2 and --ver train as in the tokenizer training experiment.

When preprocessing text in a dataset, each text sample in the training dataset is tokenized and encoded. After encoding text as token id sequence, we split token id sequence into multiple subsequences called context window. Context windows are then used to train a language model. Context windows are collectively called “language model formatting dataset”. Each context window have the same length which is defined to be --max_seq_len. Context windows with length shorter than --max_seq_len will be padded to have length equal to --max_seq_len. Context windows may have overlaps. The number of overlapping tokens (or token ids) is controlled by --stride. If one wish to have no overlappings, one can set --max_seq_len and --stride to the same number.

Optimization algorithm arguments#

The optimization algorithm is torch.optim.AdamW. Due to memory size limit and computation cost, we chunk dataset into mini-batch to perform optimization. The batch size is set by the argument --batch_size. In the example above, we will fetch 64 samples from the wiki-text-2 dataset to perform optimization. We train language model by epoch, i.e., we sample dataset without repetitions util every sample has been used to train once.

An optimization step is performed on a batch of context windows. The total number of optimization steps is set by --total_step. No padding tokens will contribute to loss. One can adjust context window size by changing the value of --max_seq_len. The arguments directly passed to torch.optim.AdamW are --beta1, --beta2, --eps, --lr and --weight_decay. The betas parameter for torch.optim.AdamW are split into --beta1 and --beta2. The eps for torch.optim.AdamW is given by --eps. The weight_decay parameter is given by --weight_decay. The learning rate is given by --lr. Learning rate is scheduled to linearly warmup to the value given by --lr and linearly decay to 0 after reaching the peak value. The number of steps to warmup is set by --warmup_step. To avoid gradient explosion, we use the max norm argument --max_norm to make the gradient norm of all parameters have an upper bound. Gradients with norm larger than max norm will be clipped to max norm.

Logging arguments#

For each 1000 steps, we will save the model parameters under the experiment path project_root/exp/my_model_exp. This is done by setting --ckpt_step 1000 and --exp_name my_model_exp. Similarly, by setting --log_step 200, we log model performance for each 200 steps. We log model performance on CLI and tensorboard. One can launch tensorboard and open browser with URL http://localhost:6006 to see the performance logs. Use the following script to launch tensorboard:

pipenv run tensorboard

See also

lmp.model: All available language models.

Evaluate dataset perplexity#

To perform perplexity evaluation on a dataset, one use the evaluation script lmp.script.eval_dset_ppl:

python -m lmp.script.eval_dset_ppl wiki-text-2 \
  --batch_size 32 \
  --first_ckpt 0 \
  --exp_name my_model_exp \
  --ver valid

One use a dataset’s name as first argument. The specific version of a dataset to evaluate can be set by --ver argument. One need to specify which experiment to evaluate using --exp_name argument. Since evaluation does not construct tensor graph, one can use larger --batch_size compare to training. Other settings like context window or maximum sequence length will follow the training settings. Since pre-trained language model are saved as model parameters checkpoints, one also need to specify which checkpoint to evaluate. But there are lots of checkpoints can be evaluated, thus we provide two arguments --first_ckpt and --last_ckpt for one to specify the starting (first) and the end (last) checkpoint numbers to be evaluated. To evaluate every checkpoints, one can simply set --first_ckpt 0 and --last_ckpt -1.

python -m lmp.script.eval_dset_ppl wiki-text-2 \
  --batch_size 32 \
  --first_ckpt 0 \
  --exp_name my_model_exp \
  --last_ckpt -1 \
  --ver valid

We use tensorboard to log the evaluation results. One can launch tensorboard and open browser with URL http://localhost:6006 to see the evaluation results. Use the following script to launch tensorboard:

pipenv run tensorboard

Evaluate text perplexity#

To evaluate language model perplexity on a given text instead of a dataset, one use the script lmp.script.eval_txt_ppl.

python -m lmp.script.eval_txt_ppl --ckpt -1 \
  --exp_name my_model_exp \
  --txt "hello world"

We use --ckpt to specify the evalution checkpoint, and use --exp_name to specify evaluation experiment name. In the above example, we evaluate the character sequence "hello world" by setting --txt "hello world".

Generate continual text#

One can use a pre-trained language model to generate continual text on the given text. One use the script lmp.script.gen_txt and select a decoding strategy to generate continual text.

python -m lmp.script.gen_txt top-1 \
  --ckpt -1 \
  --exp_name my_model_exp \
  --max_seq_len 32 \
  --txt "A very good"

The first argument is the name of inference method. Different inference method have different arguments. For example, the top-P TopPInfer inference method must specify the probability threshold --p to perform inference. As in evaluating language model, one specify experiment name and checkpoint to perform generation. When setting --ckpt -1, one use the last checkpoint to generate continual text. The conditional text to perform generation is give by --txt. To avoid non-stopping generation, one specify maximum length to generate by setting --max_seq_len.

See also

lmp.infer: All available inference methods.

All available scripts are listed below: