lmp.script.train_tknzr
#
Use this script to train tokenizer on a dataset.
This script must be run before training language model. Once a tokenizer is trained it can be shared throughout different scripts.
The following script train a character tokenizer CharTknzr
on demo dataset
DemoDset
.
python -m lmp.script.train_tknzr character
The tokenizer training experiment is named as my_tknzr_exp
.
The training result will be saved at path project_root/exp/my_tknzr_exp
and can be reused by other scripts.
To use different name, one can set the --exp_name
argument.
python -m lmp.script.train_tknzr character --exp_name other_name
One might need to train tokenizer on different dataset.
This can be achieved using --dset_name
argument.
python -m lmp.script.train_tknzr character --dset_name wiki-text-2
Tokenizer’s hyperparameters can be passed as arguments.
For example, one can set max_vocab
and min_count
using --max_vocab
and --min_count
arguments.
python -m lmp.script.train_tknzr character --max_vocab 100 --min_count 2
Note that boolean hyperparameters are set to False
if not given, and set to True
if given.
# Setting `is_uncased=False`.
python -m lmp.script.train_tknzr character
# Setting `is_uncased=True`.
python -m lmp.script.train_tknzr character --is_uncased
To train a different tokenizer, change the first argument to the specific tokenizer’s name.
python -m lmp.script.train_tknzr whitespace
You can use -h
or --help
options to get a list of available tokenizers.
python -m lmp.script.train_tknzr -h
You can use -h
or --help
options on a specific tokenizer to get a list of supported CLI arguments.
python -m lmp.script.train_tknzr whitespace -h
See also
- lmp.dset
All available datasets.
- lmp.script.sample_dset
Get a glimpse on all available datasets.
- lmp.script.tknz_txt
Use pre-trained tokenizer to perform tokenization on given text.
- lmp.tknzr
All available tokenizers.