Wiki-Text-2 Dataset#

class lmp.dset.WikiText2Dset(*, ver: Optional[str] = None)[source]#

Wiki-Text-2 dataset.

Wiki-Text-2 1 is part of the WikiText Long Term Dependency Language Modeling Dataset. See Wiki-Text for more details.

Here are the statistics of each supported version. Tokens are separated by whitespaces.

Version	Number of samples	Maximum number of tokens	Minimum number of tokens
`test`	60	14299	461
`train`	600	17706	281
`valid`	60	18855	778

Parameters: ver (Optional[str], default: None) – Version of the dataset. Set to None to use the default version self.__class__.df_ver.

df_ver#

Default version is 'train'.

dset_name#

CLI name of Wiki-Text-2 dataset is wiki-text-2.

spls#

All samples in the dataset.

ver#

Version of the dataset.

vers#

Supported versions including 'train', 'test' and 'valid'.

Examples

>>> from lmp.dset import WikiText2Dset
>>> dset = WikiText2Dset(ver='test')
>>> dset[0][:31]
'Robert <unk> is an English film'

__getitem__(idx: int) → str#

Sample text using index.

__iter__() → Iterator[str]#

Iterate through each sample in the dataset.

__len__() → int#

Get dataset size.

classmethod download_dataset() → None[source]#

Download Wiki-text-2 dataset.

Download zip file from https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip and extract raw files from zip file. Raw files are named as wiki.ver.tokens, where ver is the version of the dataset. After extracting raw files the downloaded zip file will be deleted.

static download_file(mode: str, download_path: str, url: str) → None#

Download file from url.

Parameters

Return type

None

static norm(txt: str) → str#

Text normalization.

Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.