Wiki-Text-2 Dataset#

class lmp.dset.WikiText2Dset(*, ver: Optional[str] = None)[source]#

Bases: BaseDset

Wiki-Text-2 dataset.

Wiki-Text-2 1 is part of the WikiText Long Term Dependency Language Modeling Dataset. See Wiki-Text for more details.

Here are the statistics of each supported version. Tokens are separated by whitespaces.

Version

Number of samples

Maximum number of tokens

Minimum number of tokens

test

60

14299

461

train

600

17706

281

valid

60

18855

778

Parameters

ver (Optional[str], default: None) – Version of the dataset. Set to None to use the default version self.__class__.df_ver.

df_ver#

Default version is 'train'.

Type

ClassVar[str]

dset_name#

CLI name of Wiki-Text-2 dataset is wiki-text-2.

Type

ClassVar[str]

spls#

All samples in the dataset.

Type

list[str]

ver#

Version of the dataset.

Type

str

vers#

Supported versions including 'train', 'test' and 'valid'.

Type

ClassVar[list[str]]

Examples

>>> from lmp.dset import WikiText2Dset
>>> dset = WikiText2Dset(ver='test')
>>> dset[0][:31]
'Robert <unk> is an English film'
__getitem__(idx: int) str#

Sample text using index.

Parameters

idx (int) – Sample index.

Returns

The sample whose index equals to idx.

Return type

str

__iter__() Iterator[str]#

Iterate through each sample in the dataset.

Yields

str – One sample in self.spls, ordered by sample indices.

__len__() int#

Get dataset size.

Returns

Number of samples in the dataset.

Return type

int

classmethod download_dataset() None[source]#

Download Wiki-text-2 dataset.

Download zip file from https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip and extract raw files from zip file. Raw files are named as wiki.ver.tokens, where ver is the version of the dataset. After extracting raw files the downloaded zip file will be deleted.

Return type

None

static download_file(mode: str, download_path: str, url: str) None#

Download file from url.

Parameters
  • mode (str) – Can only be 'binary' or 'text'.

  • download_path (str) – File path of the downloaded file.

  • url (str) – URL of the file to be downloaded.

Return type

None

static norm(txt: str) str#

Text normalization.

Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.

Parameters

txt (str) – Text to be normalized.

Returns

Normalized text.

Return type

str

See also

unicodedata.normalize

Python built-in unicode normalization.

Examples

>>> from lmp.dset import BaseDset
>>> BaseDset.norm('123456789')
'123456789'
1

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations. 2017. URL: https://openreview.net/forum?id=Byj72udxe.