lmp.util.tknzr#

Tokenizer utilities.

lmp.util.tknzr.create(tknzr_name: str, **kwargs: Any) BaseTknzr[source]#

Create tokenizer instance by tokenizer’s name.

Tokenizer’s arguments are collected in **kwargs and are passed directly to tokenizer’s constructor.

Parameters
  • tknzr_name (str) – Name of the tokenizer to create.

  • kwargs (Any, optional) – Tokenizer’s parameters.

Returns

Tokenizer instance.

Return type

lmp.tknzr.BaseTknzr

See also

lmp.tknzr

All available tokenizers.

Examples

>>> from lmp.tknzr import WsTknzr
>>> import lmp.util.tknzr
>>> tknzr = lmp.util.tknzr.create(
...   is_uncased=False,
...   max_vocab=10,
...   min_count=2,
...   tknzr_name=WsTknzr.tknzr_name,
... )
>>> assert isinstance(tknzr, WsTknzr)
>>> assert not tknzr.is_uncased
>>> assert tknzr.max_vocab == 10
>>> assert tknzr.min_count == 2
lmp.util.tknzr.load(exp_name: str) BaseTknzr[source]#

Load pre-trained tokenizer from pickle file.

Load pre-trained tokenizer from path project_root/exp/exp_name.

Parameters

exp_name (str) – Pre-trained tokenizer experiment name.

Returns

Pre-trained tokenizer instance.

Return type

lmp.tknzr.BaseTknzr

See also

lmp.tknzr

All available tokenizers.

Examples

>>> from lmp.tknzr import WsTknzr
>>> import lmp.util.tknzr
>>> tknzr = lmp.util.tknzr.create(
...   is_uncased=True,
...   max_vocab=10,
...   min_count=2,
...   tknzr_name=WsTknzr.tknzr_name,
... )
>>> tknzr.save(exp_name='my_exp')
>>> load_tknzr = lmp.util.tknzr.load(exp_name='my_exp')
>>> assert isinstance(load_tknzr, WsTknzr)
>>> assert load_tknzr.id2tk == tknzr.id2tk
>>> assert load_tknzr.is_uncased == tknzr.is_uncased
>>> assert load_tknzr.max_vocab == tknzr.max_vocab
>>> assert load_tknzr.min_count == tknzr.min_count
>>> assert load_tknzr.tk2id == tknzr.tk2id
lmp.util.tknzr.save(exp_name: str, tknzr: BaseTknzr) None[source]#

Save tokenizer as pickle file.

Danger

This method overwrite existing files. Make sure you know what you are doing before calling this method.

Parameters
  • exp_name (int) – Tokenizer training experiment name.

  • tknzr (lmp.model.BaseTknzr) – Tokenizer to be saved.

Return type

None

See also

load

Load pre-trained tokenizer instance by experiment name.

Examples

>>> from lmp.tknzr import CharTknzr
>>> import lmp.util.tknzr
>>> tknzr = CharTknzr(is_uncased=False, max_vocab=10, min_count=2)
>>> lmp.util.tknzr.save(exp_name='test', tknzr=tknzr)
None