`lmp.tknzr._bpe`#

Byte-Paired Encoding tokenizer class.

lmp.tknzr._bpe.EOW#

Special token indicating end-of-word.

Type: Final[str]

lmp.tknzr._bpe.SPLIT_PTTN#

Special tokens and whitespaces matching pattern.

Type: Final[re.Pattern]

class lmp.tknzr._bpe.BPETknzr(*, is_uncased: bool = False, max_vocab: int = -1, min_count: int = 0, n_merge: int = 10000, **kwargs: Any)[source]#

Bases: BaseTknzr

Byte-Pair Encoding 1 tokenizer class.

Tokenize text into list of subwords. When max_vocab is set to -1, this tokenizer will contain every unicode character and every whitespace separated tokens in its vocabulary.

Parameters

is_uncased (bool, default: False) – Set to True to convert text into lowercase. Mainly used by norm.
max_vocab (int, default: -1) – Tokenizer’s maximum vocabulary size. Set to -1 to include as many token in vocabulary as possible. Mainly used by build_vocab.
min_count (int, default: 0) – Minimum token occurrence counts. Subwords have occurrence counts less than min_count will not be added to tokenizer’s vocabulary. Mainly used by build_vocab.
n_merge (int, default: 10000) – Maximum number of merging operation to perform.
kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.

id2tk#

Byte-to-id inverse lookup table.

Type: dict[int, str]

is_uncased#

Convert text into lowercase if set to True.

Type: bool

max_vocab#

Tokenizer’s maximum vocabulary size.

Type: int

min_count#

Minimum token occurrence counts.

Type: int

n_merge#

Maximum number of merging operation to perform.

Type: int

tk2id#

Subword-to-id lookup table.

Type: dict[str, int]

tknzr_name#

CLI name of BPE tokenizer is BPE.

Type: ClassVar[str]

See also

lmp.tknzr: All available tokenizers.

Examples

>>> from lmp.tknzr import BPETknzr
>>> tknzr = BPETknzr()
>>> assert tknzr.tknz('abc def') == ['abc<eow>', 'def<eow>']
>>> assert tknzr.dtknz(['abc<eow>', 'def<eow>']) == 'abc def'

classmethod add_CLI_args(parser: ArgumentParser) → None[source]#

Add tokenizer hyperparameters to CLI argument parser.

Parameters: parser (argparse.ArgumentParser) – CLI argument parser.
Return type: None

See also

lmp.script.train_tknzr: Tokenizer training script.

Examples

>>> import argparse
>>> from lmp.tknzr import BPETknzr
>>> parser = argparse.ArgumentParser()
>>> BPETknzr.add_CLI_args(parser)
>>> args = parser.parse_args([
...   '--max_vocab', '10',
...   '--min_count', '2',
...   '--n_merge', '5000',
... ])
>>> assert args.is_uncased == False
>>> assert args.max_vocab == 10
>>> assert args.min_count == 2
>>> assert args.n_merge == 5000

build_vocab(batch_txt: Iterable[str]) → None[source]#

Build tokenizer’s vocabulary.

Build vocabulary based on subword occurrence counts. Text in batch_txt is first normalized and splited into unicode characters. All unicode characters having occurrence count higher than self.min_count are included into vocabulary. After adding unicode characters to vocabulary, we treat each unicode character as subword and merge subword pairs with cooccurrence count higher than self.min_count into new subword. Merging operation is done at most self.n_merge times. After stopping merging subword, we add subwords with high occurrence count into vocabulary.

Parameters: batch_txt (collections.abc.Iterable[str]) – Source of text to build vocabulary.
Return type: None

See also

norm: Perform normalization on text.
vocab_size: Tokenizer’s vocabulary size.

dec(tkids: List[int], *, rm_sp_tks: bool = False) → str#

Decode token id list back to text.

Token id list is first converted into token list then detokenized back to text. Special tokens other than <unk> will be removed if setting rm_sp_tks=True. Token ids not in tokenizer’s inverse lookup table are converted into <unk> token.

Parameters

tkids (list[int]) – Token id list to be decoded.
rm_sp_tks (bool, default: False) – Set to True to remove <bos>, <eos> and <pad>.

Returns

Decoded text.

Return type

str

See also

dtknz: Convert tokens back to text.
enc: Encode text into token id list.

Note

Unknown tokens <unk> will not be removed even if setting rm_sp_tks=True. This is simply because we do not know which token to convert it back (thus the name unknown token).

dtknz(tks: List[str]) → str[source]#

Convert list of words and subwords back to text.

First of all, subwords are joined into word without whitespaces and EOS are removed. Then words are joined with whitespaces. Returned text is normalized.

Parameters: tks (list[str]) – List of words and subwords to be joint.
Returns: Normalized text with whitespaces in between.
Return type: str

See also

tknz: Convert text into list of words and subwords.
norm: Text normalization.

Examples

>>> from lmp.tknzr import BPETknzr
>>> tknzr = BPETknzr()
>>> assert tknzr.dtknz(['abc<eow>', 'def<eow>']) == 'abc def'

enc(txt: str) → List[int]#

Encode text into token id list.

Text will be tokenized into token list (tk_0, tk_1, ..., tk_n) and formatted as follow:

<bos> tk_0 tk_1 ... tk_n <eos>

<bos> is the “begin of sequence” token.
<eos> is the “end of sequence” token.
<unk> token is used to replace OOV tokens.

All tokens in token list are converted into token ids and returned.

Parameters: txt (str) – Text to be encoded.
Returns: Token ids list.
Return type: list[int]

See also

dec: Decode token id list back to text.
pad_to_max: Pad token id list to specified length.
tknz: Perform tokenization on text.

norm(txt: str) → str#

Perform text normalization.

Text are normalized by NFKC. Whitespaces are collapsed and stripped from both ends. Text are converted into lowercase if setting is_uncased=True.

Parameters: txt (str) – Text to be normalized.
Returns: Normalized text.
Return type: str

See also

unicodedata.normalize: Python built-in unicode normalization.

Examples

Convert text to lowercase.

>>> from lmp.tknzr import CharTknzr
>>> tknzr = CharTknzr(is_uncased=True)
>>> assert tknzr.norm('ABC') == 'abc'

pad_to_max(max_seq_len: int, tkids: List[int]) → List[int]#

Pad token id list to specified length.

If len(tkids) < max_seq_len, then append padding token id at the end of tkids until tkids has length equal to max_seq_len. Do nothing when len(tkids) >= max_seq_len.

Parameters

max_seq_len (int) – Maximum length constraint.
tkids (list[int]) – Token id list to be padded.

Returns

Padded token id list.

Return type

list[int]

Examples

>>> from lmp.vars import PAD_TKID
>>> from lmp.tknzr import CharTknzr
>>> tknzr = CharTknzr()
>>> assert tknzr.pad_to_max(max_seq_len=4, tkids=[1, 2, 3]) == [1, 2, 3, PAD_TKID]

tknz(txt: str) → List[str][source]#

Convert text into list of words and subwords.

Text is first normalized then splitted by whitespace. Each whitespace separated token is then converted into a word or list of subwords based on whether that token is in vocabulary or not. Each special token is treated as an unit and thus is not splitted.

Parameters: txt (str) – Text to be tokenized.
Returns: List of words and subwords.
Return type: list[str]

See also

dtknz: Convert list of words and subwords back to text.
norm: Text normalization.

Examples

>>> from lmp.tknzr import BPETknzr
>>> tknzr = BPETknzr()
>>> assert tknzr.tknz('abc def') == ['abc<eow>', 'def<eow>']

property vocab_size: int#

Get tokenizer vocabulary size.

Returns: Tokenizer vocabulary size.
Return type: int

See also

build_vocab: Build vocabulary for tokenizer.

1: Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725. Berlin, Germany, August 2016. Association for Computational Linguistics. URL: https://aclanthology.org/P16-1162, doi:10.18653/v1/P16-1162.

lmp.tknzr._bpe#

`lmp.tknzr._bpe`#