Winograd NLI dataset#

class lmp.dset.WNLIDset(*, ver: Optional[str] = None)[source]#

Bases: BaseDset

Winograd NLI dataset.

Winograd NLI is a relaxation of the Winograd Schema Challenge 1 proposed as part of the GLUE 2 benchmark. This dataset only extract sentences from WNLI and no NLI labels were used.

Here are the statistics of each supported version. Tokens are separated by whitespaces.

Version	Number of samples	Maximum number of tokens	Minimum number of tokens
`dev`	142	63	4
`test`	292	60	4
`train`	1270	63	3

Parameters: ver (Optional[str], default: None) – Version of the dataset. Set to None to use the default version self.__class__.df_ver.

df_ver#

Default version is train.

Type: ClassVar[str]

dset_name#

CLI name of WNLI dataset is WNLI.

Type: ClassVar[str]

spls#

All samples in the dataset.

Type: list[str]

ver#

Version of the dataset.

Type: str

vers#

Supported versions including 'train', 'dev' and 'test'.

Type: ClassVar[list[str]]

Examples

>>> from lmp.dset import WNLIDset
>>> dset = WNLIDset(ver='test')
>>> dset[0]
Mark was timid .

__getitem__(idx: int) → str#

Sample text using index.

Parameters: idx (int) – Sample index.
Returns: The sample whose index equals to idx.
Return type: str

__iter__() → Iterator[str]#

Iterate through each sample in the dataset.

Yields: str – One sample in self.spls, ordered by sample indices.

__len__() → int#

Get dataset size.

Returns: Number of samples in the dataset.
Return type: int

classmethod download_dataset() → None[source]#

Download WNLI dataset.

Download zip file from https://dl.fbaipublicfiles.com/glue/data/WNLI.zip and extract raw files from zip file. Raw files are named as wnli.ver.tsv, where ver is the version of the dataset. After extracting raw files the downloaded zip file will be deleted.

Return type: None

static download_file(mode: str, download_path: str, url: str) → None#

Download file from url.

Parameters

mode (str) – Can only be 'binary' or 'text'.
download_path (str) – File path of the downloaded file.
url (str) – URL of the file to be downloaded.

Return type

None

static norm(txt: str) → str#

Text normalization.

Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.

Parameters: txt (str) – Text to be normalized.
Returns: Normalized text.
Return type: str

See also

unicodedata.normalize: Python built-in unicode normalization.

Examples

>>> from lmp.dset import BaseDset
>>> BaseDset.norm('１２３４５６７８９')
'123456789'

1: Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning. 2012. URL: https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html.
2: Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–355. Brussels, Belgium, November 2018. Association for Computational Linguistics. URL: https://aclanthology.org/W18-5446, doi:10.18653/v1/W18-5446.