Wiki-Text-2 Dataset#
- class lmp.dset.WikiText2Dset(*, ver: Optional[str] = None)[source]#
Bases:
BaseDset
Wiki-Text-2 dataset.
Wiki-Text-2 1 is part of the WikiText Long Term Dependency Language Modeling Dataset. See Wiki-Text for more details.
Here are the statistics of each supported version. Tokens are separated by whitespaces.
Version
Number of samples
Maximum number of tokens
Minimum number of tokens
test
60
14299
461
train
600
17706
281
valid
60
18855
778
- Parameters
ver (Optional[str], default: None) – Version of the dataset. Set to
None
to use the default versionself.__class__.df_ver
.
Examples
>>> from lmp.dset import WikiText2Dset >>> dset = WikiText2Dset(ver='test') >>> dset[0][:31] 'Robert <unk> is an English film'
- __iter__() Iterator[str] #
Iterate through each sample in the dataset.
- Yields
str – One sample in
self.spls
, ordered by sample indices.
- classmethod download_dataset() None [source]#
Download Wiki-text-2 dataset.
Download zip file from https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip and extract raw files from zip file. Raw files are named as
wiki.ver.tokens
, wherever
is the version of the dataset. After extracting raw files the downloaded zip file will be deleted.- Return type
None
- static norm(txt: str) str #
Text normalization.
Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.
See also
unicodedata.normalize
Python built-in unicode normalization.
Examples
>>> from lmp.dset import BaseDset >>> BaseDset.norm('123456789') '123456789'
- 1
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations. 2017. URL: https://openreview.net/forum?id=Byj72udxe.