lmp.util.dset#

Dataset utilities.

class lmp.util.dset.LMFormatDset(*, dset: BaseDset, max_seq_len: int, stride: int, tknzr: BaseTknzr)[source]#

Convert dataset into language model training format.

Each dataset samples is converted into token id sequence. Token id sequence is splitted into multiple subsequences. All subsequences have the same length.

Parameters
  • max_seq_len (int) – Context window size applied on dataset samples.

  • stride (int) – Context windows may have overlaps. Number of overlapping tokens between subsamples is called stride.

batch_cur_tkids#

Language model input token ids.

Type

list[torch.Tensor]

batch_is_not_ctx#

Boolean tensor indicate whether token ids are used as conditional context or not. Conditional context means tokens that are overlapping with other context windows.

Type

list[torch.Tensor]

batch_next_tkids#

Language model prediction target.

Type

list[torch.Tensor]

n_tk#

Number of tokens in the dataset. Overlapping tokens are not repeatly counted. Padding tokens are not counted.

Type

int

lmp.util.dset.load(dset_name: str, ver: str, **kwargs: Any) BaseDset[source]#

Load dataset.

Parameters
  • dset_name (str) – Name of the dataset to load.

  • ver (str) – Version of the dataset to load.

Returns

Loaded dataset instance.

Return type

lmp.dset.BaseDset

See also

lmp.dset

All available datasets.

Examples

>>> from lmp.dset import WikiText2Dset
>>> import lmp.util.dset
>>> dset = lmp.util.dset.load(dset_name='wiki-text-2', ver='train')
>>> isinstance(dset, WikiText2Dset)
True
>>> dset.ver == 'train'
True