`lmp.util.dset`#

Dataset utilities.

class lmp.util.dset.LMFormatDset(*, dset: BaseDset, max_seq_len: int, stride: int, tknzr: BaseTknzr)[source]#

Convert dataset into language model training format.

Each dataset samples is converted into token id sequence. Token id sequence is splitted into multiple subsequences. All subsequences have the same length.

Parameters

max_seq_len (int) – Context window size applied on dataset samples.
stride (int) – Context windows may have overlaps. Number of overlapping tokens between subsamples is called stride.

batch_cur_tkids#

Language model input token ids.

Type: list[torch.Tensor]

batch_is_not_ctx#

Boolean tensor indicate whether token ids are used as conditional context or not. Conditional context means tokens that are overlapping with other context windows.

Type: list[torch.Tensor]

batch_next_tkids#

Language model prediction target.

Type: list[torch.Tensor]

n_tk#

Number of tokens in the dataset. Overlapping tokens are not repeatly counted. Padding tokens are not counted.

Type: int

lmp.util.dset.load(dset_name: str, ver: str, **kwargs: Any) → BaseDset[source]#

Load dataset.

Parameters

dset_name (str) – Name of the dataset to load.
ver (str) – Version of the dataset to load.

Returns

Loaded dataset instance.

Return type

lmp.dset.BaseDset

lmp.util.dset#

`lmp.util.dset`#