Tokenizer training script#
Use this script to train tokenizer on a dataset.
This script must be run before training language model. Once a tokenizer is trained it can be shared throughout different scripts.
The following script train a character tokenizer CharTknzr on demo dataset
DemoDset.
python -m lmp.script.train_tknzr character
The tokenizer training experiment is named as my_tknzr_exp.
The training result will be saved at path project_root/exp/my_tknzr_exp and can be reused by other scripts.
To use different name, one can set the --exp_name argument.
python -m lmp.script.train_tknzr character --exp_name other_name
One might need to train tokenizer on different dataset.
This can be achieved using --dset_name argument.
python -m lmp.script.train_tknzr character --dset_name wiki-text-2
Tokenizer’s hyperparameters can be passed as arguments.
For example, one can set max_vocab and min_count using --max_vocab and --min_count arguments.
python -m lmp.script.train_tknzr character --max_vocab 100 --min_count 2
Note that boolean hyperparameters are set to False if not given, and set to True if given.
# Setting `is_uncased=False`.
python -m lmp.script.train_tknzr character
# Setting `is_uncased=True`.
python -m lmp.script.train_tknzr character --is_uncased
To train a different tokenizer, change the first argument to the specific tokenizer’s name.
python -m lmp.script.train_tknzr whitespace
You can use -h or --help options to get a list of available tokenizers.
python -m lmp.script.train_tknzr -h
You can use -h or --help options on a specific tokenizer to get a list of supported CLI arguments.
python -m lmp.script.train_tknzr whitespace -h
See also
- lmp.dset
All available datasets.
- lmp.script.sample_dset
Get a glimpse on all available datasets.
- lmp.script.tknz_txt
Use pre-trained tokenizer to perform tokenization on given text.
- lmp.tknzr
All available tokenizers.