lmp.util.tknzr
#
Tokenizer utilities.
- lmp.util.tknzr.create(tknzr_name: str, **kwargs: Any) BaseTknzr [source]#
Create tokenizer instance by tokenizer’s name.
Tokenizer’s arguments are collected in
**kwargs
and are passed directly to tokenizer’s constructor.- Parameters
- Returns
Tokenizer instance.
- Return type
See also
- lmp.tknzr
All available tokenizers.
Examples
>>> from lmp.tknzr import WsTknzr >>> import lmp.util.tknzr >>> tknzr = lmp.util.tknzr.create( ... is_uncased=False, ... max_vocab=10, ... min_count=2, ... tknzr_name=WsTknzr.tknzr_name, ... ) >>> assert isinstance(tknzr, WsTknzr) >>> assert not tknzr.is_uncased >>> assert tknzr.max_vocab == 10 >>> assert tknzr.min_count == 2
- lmp.util.tknzr.load(exp_name: str) BaseTknzr [source]#
Load pre-trained tokenizer from pickle file.
Load pre-trained tokenizer from path
project_root/exp/exp_name
.- Parameters
exp_name (str) – Pre-trained tokenizer experiment name.
- Returns
Pre-trained tokenizer instance.
- Return type
See also
- lmp.tknzr
All available tokenizers.
Examples
>>> from lmp.tknzr import WsTknzr >>> import lmp.util.tknzr >>> tknzr = lmp.util.tknzr.create( ... is_uncased=True, ... max_vocab=10, ... min_count=2, ... tknzr_name=WsTknzr.tknzr_name, ... ) >>> tknzr.save(exp_name='my_exp') >>> load_tknzr = lmp.util.tknzr.load(exp_name='my_exp') >>> assert isinstance(load_tknzr, WsTknzr) >>> assert load_tknzr.id2tk == tknzr.id2tk >>> assert load_tknzr.is_uncased == tknzr.is_uncased >>> assert load_tknzr.max_vocab == tknzr.max_vocab >>> assert load_tknzr.min_count == tknzr.min_count >>> assert load_tknzr.tk2id == tknzr.tk2id
- lmp.util.tknzr.save(exp_name: str, tknzr: BaseTknzr) None [source]#
Save tokenizer as pickle file.
Danger
This method overwrite existing files. Make sure you know what you are doing before calling this method.
- Parameters
exp_name (int) – Tokenizer training experiment name.
tknzr (lmp.model.BaseTknzr) – Tokenizer to be saved.
- Return type
None
See also
load
Load pre-trained tokenizer instance by experiment name.
Examples
>>> from lmp.tknzr import CharTknzr >>> import lmp.util.tknzr >>> tknzr = CharTknzr(is_uncased=False, max_vocab=10, min_count=2) >>> lmp.util.tknzr.save(exp_name='test', tknzr=tknzr) None