Tokenizer training script#

Use this script to train tokenizer on a dataset.

This script must be run before training language model. Once a tokenizer is trained it can be shared throughout different scripts.

The following script train a character tokenizer CharTknzr on demo dataset DemoDset.

python -m lmp.script.train_tknzr character

The tokenizer training experiment is named as my_tknzr_exp. The training result will be saved at path project_root/exp/my_tknzr_exp and can be reused by other scripts. To use different name, one can set the --exp_name argument.

python -m lmp.script.train_tknzr character --exp_name other_name

One might need to train tokenizer on different dataset. This can be achieved using --dset_name argument.

python -m lmp.script.train_tknzr character --dset_name wiki-text-2

Tokenizer’s hyperparameters can be passed as arguments. For example, one can set max_vocab and min_count using --max_vocab and --min_count arguments.

python -m lmp.script.train_tknzr character --max_vocab 100 --min_count 2

Note that boolean hyperparameters are set to False if not given, and set to True if given.

# Setting `is_uncased=False`.
python -m lmp.script.train_tknzr character
# Setting `is_uncased=True`.
python -m lmp.script.train_tknzr character --is_uncased

To train a different tokenizer, change the first argument to the specific tokenizer’s name.

python -m lmp.script.train_tknzr whitespace

You can use -h or --help options to get a list of available tokenizers.

python -m lmp.script.train_tknzr -h

You can use -h or --help options on a specific tokenizer to get a list of supported CLI arguments.

python -m lmp.script.train_tknzr whitespace -h