Demo Dataset#

class lmp.dset.DemoDset(*, ver: Optional[str] = None)[source]#

Bases: BaseDset

Demo dataset.

This dataset is consist of 2-digits addition literatures. All literatures have the following format:

If you add \(a\) to \(b\) you get \(a + b\) .

where \(a, b\) are integers within \(0\) to \(99\) (inclusive).

Here we describe the dataset in detail. Let \(N = \set{0, 1, \dots, 99}\) be the set of non-negative integers which are less than \(100\). Let \(a, b \in N\).

Version

Design Philosophy

Constraint

train

Training set.

\(a < b\)

valid

Check whether model learn commutative law on 2-digits integer addition.

\(a > b\)

test

Check whether model learn to generalize 2-digits addition.

\(a = b\)

Parameters

ver (Optional[str], default: None) – Version of the dataset. Set to None to use the default version self.__class__.df_ver.

df_ver#

Default version is 'train'.

Type

ClassVar[str]

dset_name#

CLI name of demo dataset is demo.

Type

ClassVar[str]

spls#

All samples in the dataset.

Type

list[str]

ver#

Version of the dataset.

Type

str

vers#

Supported versions including 'train', 'test' and 'valid'.

Type

ClassVar[list[str]]

See also

lmp.dset

All available datasets.

Examples

>>> from lmp.dset import DemoDset
>>> dset = DemoDset(ver='train')
>>> dset[0]
'If you add 0 to 1 you get 1 .'
__getitem__(idx: int) str#

Sample text using index.

Parameters

idx (int) – Sample index.

Returns

The sample whose index equals to idx.

Return type

str

__iter__() Iterator[str]#

Iterate through each sample in the dataset.

Yields

str – One sample in self.spls, ordered by sample indices.

__len__() int#

Get dataset size.

Returns

Number of samples in the dataset.

Return type

int

static download_file(mode: str, download_path: str, url: str) None#

Download file from url.

Parameters
  • mode (str) – Can only be 'binary' or 'text'.

  • download_path (str) – File path of the downloaded file.

  • url (str) – URL of the file to be downloaded.

Return type

None

static norm(txt: str) str#

Text normalization.

Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.

Parameters

txt (str) – Text to be normalized.

Returns

Normalized text.

Return type

str

See also

unicodedata.normalize

Python built-in unicode normalization.

Examples

>>> from lmp.dset import BaseDset
>>> BaseDset.norm('123456789')
'123456789'