lmp.util.dset
#
Dataset utilities.
- class lmp.util.dset.LMFormatDset(*, dset: BaseDset, max_seq_len: int, stride: int, tknzr: BaseTknzr)[source]#
Convert dataset into language model training format.
Each dataset samples is converted into token id sequence. Token id sequence is splitted into multiple subsequences. All subsequences have the same length.
- Parameters
- batch_cur_tkids#
Language model input token ids.
- Type
- batch_is_not_ctx#
Boolean tensor indicate whether token ids are used as conditional context or not. Conditional context means tokens that are overlapping with other context windows.
- Type
- batch_next_tkids#
Language model prediction target.
- Type
- lmp.util.dset.load(dset_name: str, ver: str, **kwargs: Any) BaseDset [source]#
Load dataset.
- Parameters
- Returns
Loaded dataset instance.
- Return type
See also
- lmp.dset
All available datasets.
Examples
>>> from lmp.dset import WikiText2Dset >>> import lmp.util.dset >>> dset = lmp.util.dset.load(dset_name='wiki-text-2', ver='train') >>> isinstance(dset, WikiText2Dset) True >>> dset.ver == 'train' True