Tokenizer training script#

Use this script to train tokenizer on a dataset.

This script must be run before training language model. Once a tokenizer is trained it can be shared throughout different scripts.

The following script train a character tokenizer CharTknzr on demo dataset DemoDset.

python -m lmp.script.train_tknzr character

The tokenizer training experiment is named as my_tknzr_exp. The training result will be saved at path project_root/exp/my_tknzr_exp and can be reused by other scripts. To use different name, one can set the --exp_name argument.

python -m lmp.script.train_tknzr character --exp_name other_name

One might need to train tokenizer on different dataset. This can be achieved using --dset_name argument.

python -m lmp.script.train_tknzr character --dset_name wiki-text-2

Tokenizer’s hyperparameters can be passed as arguments. For example, one can set max_vocab and min_count using --max_vocab and --min_count arguments.

python -m lmp.script.train_tknzr character --max_vocab 100 --min_count 2

Note that boolean hyperparameters are set to False if not given, and set to True if given.

# Setting `is_uncased=False`.
python -m lmp.script.train_tknzr character
# Setting `is_uncased=True`.
python -m lmp.script.train_tknzr character --is_uncased

To train a different tokenizer, change the first argument to the specific tokenizer’s name.

python -m lmp.script.train_tknzr whitespace

You can use -h or --help options to get a list of available tokenizers.

python -m lmp.script.train_tknzr -h

You can use -h or --help options on a specific tokenizer to get a list of supported CLI arguments.

python -m lmp.script.train_tknzr whitespace -h

See also

lmp.dset

All available datasets.

lmp.script.sample_dset

Get a glimpse on all available datasets.

lmp.script.tknz_txt

Use pre-trained tokenizer to perform tokenization on given text.

lmp.tknzr

All available tokenizers.

lmp.script.train_tknzr.main(argv: List[str]) None[source]#

Script entry point.

Parameters

argv (list[str]) – List of CLI arguments.

Return type

None

lmp.script.train_tknzr.parse_args(argv: List[str]) Namespace[source]#

Parse CLI arguments.

Parameters

argv (list[str]) – List of CLI arguments.

See also

sys.argv

Python CLI arguments interface.

Returns

Parsed CLI arguments.

Return type

argparse.Namespace