lmp.dset._wnli#

WNLI dataset.

class lmp.dset._wnli.WNLIDset(*, ver: Optional[str] = None)[source]#

Bases: BaseDset

Winograd NLI dataset.

Winograd NLI is a relaxation of the Winograd Schema Challenge 1 proposed as part of the GLUE 2 benchmark. This dataset only extract sentences from WNLI and no NLI labels were used.

Here are the statistics of each supported version. Tokens are separated by whitespaces.

Version

Number of samples

Maximum number of tokens

Minimum number of tokens

dev

142

63

4

test

292

60

4

train

1270

63

3

Parameters

ver (Optional[str], default: None) – Version of the dataset. Set to None to use the default version self.__class__.df_ver.

df_ver#

Default version is train.

Type

ClassVar[str]

dset_name#

CLI name of WNLI dataset is WNLI.

Type

ClassVar[str]

spls#

All samples in the dataset.

Type

list[str]

ver#

Version of the dataset.

Type

str

vers#

Supported versions including 'train', 'dev' and 'test'.

Type

ClassVar[list[str]]

Examples

>>> from lmp.dset import WNLIDset
>>> dset = WNLIDset(ver='test')
>>> dset[0]
Mark was timid .
classmethod download_dataset() None[source]#

Download WNLI dataset.

Download zip file from https://dl.fbaipublicfiles.com/glue/data/WNLI.zip and extract raw files from zip file. Raw files are named as wnli.ver.tsv, where ver is the version of the dataset. After extracting raw files the downloaded zip file will be deleted.

Return type

None

static download_file(mode: str, download_path: str, url: str) None#

Download file from url.

Parameters
  • mode (str) – Can only be 'binary' or 'text'.

  • download_path (str) – File path of the downloaded file.

  • url (str) – URL of the file to be downloaded.

Return type

None

static norm(txt: str) str#

Text normalization.

Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.

Parameters

txt (str) – Text to be normalized.

Returns

Normalized text.

Return type

str

See also

unicodedata.normalize

Python built-in unicode normalization.

Examples

>>> from lmp.dset import BaseDset
>>> BaseDset.norm('123456789')
'123456789'
1

Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning. 2012. URL: https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html.

2

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–355. Brussels, Belgium, November 2018. Association for Computational Linguistics. URL: https://aclanthology.org/W18-5446, doi:10.18653/v1/W18-5446.