Whitespace Tokenizer#
- class lmp.tknzr.WsTknzr(*, is_uncased: bool = False, max_vocab: int = -1, min_count: int = 0, **kwargs: Any)[source]#
Bases:
BaseTknzr
Whitespace tokenizer class.
Tokenize text into whitespaces seperated tokens. No whitespace will be preserved after tokenization.
- Parameters
is_uncased (bool, default: False) – Set to
True
to convert text into lowercase. Mainly used bynorm
.max_vocab (int, default: -1) – Tokenizer’s maximum vocabulary size. Set to
-1
to include as many tokens in vocabulary as possible. Mainly used bybuild_vocab
.min_count (int, default: 0) – Minimum token occurrence counts. Tokens have occurrence counts less than
min_count
will not be added to tokenizer’s vocabulary. Mainly used bybuild_vocab
.kwargs (Any, optional) – Useless parameter. Intently left for subclasses inheritance.
See also
- lmp.tknzr
All available tokenizers.
Examples
>>> from lmp.tknzr import WsTknzr >>> tknzr = WsTknzr() >>> assert tknzr.tknz('a b c') == ['a', 'b', 'c'] >>> assert tknzr.dtknz(['a', 'b', 'c']) == 'a b c'
- classmethod add_CLI_args(parser: ArgumentParser) None #
Add tokenizer hyperparameters to CLI argument parser.
- Parameters
parser (argparse.ArgumentParser) – CLI argument parser.
- Return type
None
See also
- lmp.script.train_tknzr
Tokenizer training script.
Examples
>>> import argparse >>> from lmp.tknzr import BaseTknzr >>> parser = argparse.ArgumentParser() >>> BaseTknzr.add_CLI_args(parser) >>> args = parser.parse_args([ ... '--max_vocab', '10', ... '--min_count', '2', ... ]) >>> assert args.is_uncased == False >>> assert args.max_vocab == 10 >>> assert args.min_count == 2
- build_vocab(batch_txt: Iterable[str]) None #
Build tokenizer’s vocabulary.
Build vocabulary based on token occurrence counts. Text in
batch_txt
is first normalized and tokenized, then count each token’s occurrence. Tokens with higher occurrence counts are added to vocabulary first. Tokens with the same occurrence counts are added to vocabulary in the order of their appearance.When adding a new token to vocabulary, its token id will be assign to the largest token id + 1. Tokens already in vocabulary are not added to vocabulary again. If a token’s occurrence count is lower than
self.min_count
, then that token is not added to vocabulary. If vocabulary size is larger than or equal toself.max_vocab
, then no new tokens are added to vocabulary.- Parameters
batch_txt (collections.abc.Iterable[str]) – Source of text to build vocabulary.
- Return type
None
See also
norm
Perform normalization on text.
tknz
Perform tokenization on text.
vocab_size
Tokenizer’s vocabulary size.
- dec(tkids: List[int], *, rm_sp_tks: bool = False) str #
Decode token id list back to text.
Token id list is first converted into token list then detokenized back to text. Special tokens other than
<unk>
will be removed if settingrm_sp_tks=True
. Token ids not in tokenizer’s inverse lookup table are converted into<unk>
token.- Parameters
- Returns
Decoded text.
- Return type
Note
Unknown tokens
<unk>
will not be removed even if settingrm_sp_tks=True
. This is simply because we do not know which token to convert it back (thus the name unknown token).
- dtknz(tks: List[str]) str [source]#
Join tokens with whitespaces.
Insert whitespace between tokens. Returned text is normalized.
- Parameters
- Returns
Normalized text with whitespaces in between.
- Return type
Examples
>>> from lmp.tknzr import WsTknzr >>> tknzr = WsTknzr() >>> assert tknzr.dtknz(['a', 'b', 'c']) == 'a b c' >>> assert tknzr.dtknz(['abc', 'def']) == 'abc def'
- enc(txt: str) List[int] #
Encode text into token id list.
Text will be tokenized into token list (
tk_0, tk_1, ..., tk_n
) and formatted as follow:<bos> tk_0 tk_1 ... tk_n <eos>
<bos>
is the “begin of sequence” token.<eos>
is the “end of sequence” token.<unk>
token is used to replace OOV tokens.
All tokens in token list are converted into token ids and returned.
See also
dec
Decode token id list back to text.
pad_to_max
Pad token id list to specified length.
tknz
Perform tokenization on text.
- norm(txt: str) str #
Perform text normalization.
Text are normalized by NFKC. Whitespaces are collapsed and stripped from both ends. Text are converted into lowercase if setting
is_uncased=True
.See also
unicodedata.normalize
Python built-in unicode normalization.
Examples
Convert text to lowercase.
>>> from lmp.tknzr import CharTknzr >>> tknzr = CharTknzr(is_uncased=True) >>> assert tknzr.norm('ABC') == 'abc'
- pad_to_max(max_seq_len: int, tkids: List[int]) List[int] #
Pad token id list to specified length.
If
len(tkids) < max_seq_len
, then append padding token id at the end oftkids
untiltkids
has length equal tomax_seq_len
. Do nothing whenlen(tkids) >= max_seq_len
.- Parameters
- Returns
Padded token id list.
- Return type
Examples
>>> from lmp.vars import PAD_TKID >>> from lmp.tknzr import CharTknzr >>> tknzr = CharTknzr() >>> assert tknzr.pad_to_max(max_seq_len=4, tkids=[1, 2, 3]) == [1, 2, 3, PAD_TKID]
- tknz(txt: str) List[str] [source]#
Split text on whitespaces.
Text is first normalized then splited on whitespaces.
- Parameters
txt (str) – Text to be tokenized.
- Returns
List of normalized whitespace-separated tokens.
- Return type
Examples
>>> from lmp.tknzr import WsTknzr >>> tknzr = WsTknzr() >>> assert tknzr.tknz('a b c') == ['a', 'b', 'c'] >>> assert tknzr.tknz('abc def') == ['abc', 'def']
- property vocab_size: int#
Get tokenizer vocabulary size.
- Returns
Tokenizer vocabulary size.
- Return type
See also
build_vocab
Build vocabulary for tokenizer.