lmp.tknzr._ws#

Whitespace tokenizer class.

lmp.tknzr._ws.SPLIT_PTTN#

Special tokens and whitespaces matching pattern.

Type

Final[re.Pattern]

class lmp.tknzr._ws.WsTknzr(*, is_uncased: bool = False, max_vocab: int = -1, min_count: int = 0, **kwargs: Any)[source]#

Bases: BaseTknzr

Whitespace tokenizer class.

Tokenize text into whitespaces seperated tokens. No whitespace will be preserved after tokenization.

Parameters
  • is_uncased (bool, default: False) – Set to True to convert text into lowercase. Mainly used by norm.

  • max_vocab (int, default: -1) – Tokenizer’s maximum vocabulary size. Set to -1 to include as many tokens in vocabulary as possible. Mainly used by build_vocab.

  • min_count (int, default: 0) – Minimum token occurrence counts. Tokens have occurrence counts less than min_count will not be added to tokenizer’s vocabulary. Mainly used by build_vocab.

  • kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.

id2tk#

Token-to-id inverse lookup table.

Type

dict[int, str]

is_uncased#

Convert text into lowercase if set to True.

Type

bool

max_vocab#

Tokenizer’s maximum vocabulary size.

Type

int

min_count#

Minimum token occurrence counts.

Type

int

tk2id#

Token-to-id lookup table.

Type

dict[str, int]

tknzr_name#

CLI name of whitespace tokenizer is whitespace.

Type

ClassVar[str]

See also

lmp.tknzr

All available tokenizers.

Examples

>>> from lmp.tknzr import WsTknzr
>>> tknzr = WsTknzr()
>>> assert tknzr.tknz('a b c') == ['a', 'b', 'c']
>>> assert tknzr.dtknz(['a', 'b', 'c']) == 'a b c'
classmethod add_CLI_args(parser: ArgumentParser) None#

Add tokenizer hyperparameters to CLI argument parser.

Parameters

parser (argparse.ArgumentParser) – CLI argument parser.

Return type

None

See also

lmp.script.train_tknzr

Tokenizer training script.

Examples

>>> import argparse
>>> from lmp.tknzr import BaseTknzr
>>> parser = argparse.ArgumentParser()
>>> BaseTknzr.add_CLI_args(parser)
>>> args = parser.parse_args([
...   '--max_vocab', '10',
...   '--min_count', '2',
... ])
>>> assert args.is_uncased == False
>>> assert args.max_vocab == 10
>>> assert args.min_count == 2
build_vocab(batch_txt: Iterable[str]) None#

Build tokenizer’s vocabulary.

Build vocabulary based on token occurrence counts. Text in batch_txt is first normalized and tokenized, then count each token’s occurrence. Tokens with higher occurrence counts are added to vocabulary first. Tokens with the same occurrence counts are added to vocabulary in the order of their appearance.

When adding a new token to vocabulary, its token id will be assign to the largest token id + 1. Tokens already in vocabulary are not added to vocabulary again. If a token’s occurrence count is lower than self.min_count, then that token is not added to vocabulary. If vocabulary size is larger than or equal to self.max_vocab, then no new tokens are added to vocabulary.

Parameters

batch_txt (collections.abc.Iterable[str]) – Source of text to build vocabulary.

Return type

None

See also

norm

Perform normalization on text.

tknz

Perform tokenization on text.

vocab_size

Tokenizer’s vocabulary size.

dec(tkids: List[int], *, rm_sp_tks: bool = False) str#

Decode token id list back to text.

Token id list is first converted into token list then detokenized back to text. Special tokens other than <unk> will be removed if setting rm_sp_tks=True. Token ids not in tokenizer’s inverse lookup table are converted into <unk> token.

Parameters
  • tkids (list[int]) – Token id list to be decoded.

  • rm_sp_tks (bool, default: False) – Set to True to remove <bos>, <eos> and <pad>.

Returns

Decoded text.

Return type

str

See also

dtknz

Convert tokens back to text.

enc

Encode text into token id list.

Note

Unknown tokens <unk> will not be removed even if setting rm_sp_tks=True. This is simply because we do not know which token to convert it back (thus the name unknown token).

dtknz(tks: List[str]) str[source]#

Join tokens with whitespaces.

Insert whitespace between tokens. Returned text is normalized.

Parameters

tks (list[str]) – Token list to be joint.

Returns

Normalized text with whitespaces in between.

Return type

str

See also

tknz

Split text on whitespaces.

norm

Text normalization.

Examples

>>> from lmp.tknzr import WsTknzr
>>> tknzr = WsTknzr()
>>> assert tknzr.dtknz(['a', 'b', 'c']) == 'a b c'
>>> assert tknzr.dtknz(['abc', 'def']) == 'abc def'
enc(txt: str) List[int]#

Encode text into token id list.

Text will be tokenized into token list (tk_0, tk_1, ..., tk_n) and formatted as follow:

<bos> tk_0 tk_1 ... tk_n <eos>
  • <bos> is the “begin of sequence” token.

  • <eos> is the “end of sequence” token.

  • <unk> token is used to replace OOV tokens.

All tokens in token list are converted into token ids and returned.

Parameters

txt (str) – Text to be encoded.

Returns

Token ids list.

Return type

list[int]

See also

dec

Decode token id list back to text.

pad_to_max

Pad token id list to specified length.

tknz

Perform tokenization on text.

norm(txt: str) str#

Perform text normalization.

Text are normalized by NFKC. Whitespaces are collapsed and stripped from both ends. Text are converted into lowercase if setting is_uncased=True.

Parameters

txt (str) – Text to be normalized.

Returns

Normalized text.

Return type

str

See also

unicodedata.normalize

Python built-in unicode normalization.

Examples

Convert text to lowercase.

>>> from lmp.tknzr import CharTknzr
>>> tknzr = CharTknzr(is_uncased=True)
>>> assert tknzr.norm('ABC') == 'abc'
pad_to_max(max_seq_len: int, tkids: List[int]) List[int]#

Pad token id list to specified length.

If len(tkids) < max_seq_len, then append padding token id at the end of tkids until tkids has length equal to max_seq_len. Do nothing when len(tkids) >= max_seq_len.

Parameters
  • max_seq_len (int) – Maximum length constraint.

  • tkids (list[int]) – Token id list to be padded.

Returns

Padded token id list.

Return type

list[int]

Examples

>>> from lmp.vars import PAD_TKID
>>> from lmp.tknzr import CharTknzr
>>> tknzr = CharTknzr()
>>> assert tknzr.pad_to_max(max_seq_len=4, tkids=[1, 2, 3]) == [1, 2, 3, PAD_TKID]
tknz(txt: str) List[str][source]#

Split text on whitespaces.

Text is first normalized then splited on whitespaces.

Parameters

txt (str) – Text to be tokenized.

Returns

List of normalized whitespace-separated tokens.

Return type

list[str]

See also

dtknz

Join text with whitespaces.

norm

Text normalization.

Examples

>>> from lmp.tknzr import WsTknzr
>>> tknzr = WsTknzr()
>>> assert tknzr.tknz('a b c') == ['a', 'b', 'c']
>>> assert tknzr.tknz('abc def') == ['abc', 'def']
property vocab_size: int#

Get tokenizer vocabulary size.

Returns

Tokenizer vocabulary size.

Return type

int

See also

build_vocab

Build vocabulary for tokenizer.