Wiki-Text-2 Dataset#
- class lmp.dset.WikiText2Dset(*, ver: Optional[str] = None)[source]#
Bases:
BaseDsetWiki-Text-2 dataset.
Wiki-Text-2 1 is part of the WikiText Long Term Dependency Language Modeling Dataset. See Wiki-Text for more details.
Here are the statistics of each supported version. Tokens are separated by whitespaces.
Version
Number of samples
Maximum number of tokens
Minimum number of tokens
test60
14299
461
train600
17706
281
valid60
18855
778
- Parameters
ver (Optional[str], default: None) – Version of the dataset. Set to
Noneto use the default versionself.__class__.df_ver.
Examples
>>> from lmp.dset import WikiText2Dset >>> dset = WikiText2Dset(ver='test') >>> dset[0][:31] 'Robert <unk> is an English film'
- __iter__() Iterator[str]#
Iterate through each sample in the dataset.
- Yields
str – One sample in
self.spls, ordered by sample indices.
- classmethod download_dataset() None[source]#
Download Wiki-text-2 dataset.
Download zip file from https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip and extract raw files from zip file. Raw files are named as
wiki.ver.tokens, whereveris the version of the dataset. After extracting raw files the downloaded zip file will be deleted.- Return type
None
- static norm(txt: str) str#
Text normalization.
Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.
See also
unicodedata.normalizePython built-in unicode normalization.
Examples
>>> from lmp.dset import BaseDset >>> BaseDset.norm('123456789') '123456789'
- 1
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations. 2017. URL: https://openreview.net/forum?id=Byj72udxe.