Dataset base class#

class lmp.dset.BaseDset(*, ver: Optional[str] = None)[source]#

Bases: Dataset

Dataset base class.

Most datasets need to be downloaded from the web. Only some of them can be generated locally. Datasets are downloaded / generated automatically if they are not on your local machine. No downloading or generation are executed if dataset files already exist on your local machine.

Parameters

ver (Optional[str], default: None) – Version of the dataset. Set to None to use the default version self.__class__.df_ver.

df_ver#

Default version of the dataset.

Type

ClassVar[str]

dset_name#

CLI name of the dataset. Only used to parse CLI arguments.

Type

ClassVar[str]

spls#

All samples in the dataset.

Type

list[str]

ver#

Version of the dataset.

Type

str

vers#

List of dataset supported versions.

Type

ClassVar[list[str]]

See also

lmp.dset

All available datasets.

__getitem__(idx: int) str[source]#

Sample text using index.

Parameters

idx (int) – Sample index.

Returns

The sample whose index equals to idx.

Return type

str

__iter__() Iterator[str][source]#

Iterate through each sample in the dataset.

Yields

str – One sample in self.spls, ordered by sample indices.

__len__() int[source]#

Get dataset size.

Returns

Number of samples in the dataset.

Return type

int

static download_file(mode: str, download_path: str, url: str) None[source]#

Download file from url.

Parameters
  • mode (str) – Can only be 'binary' or 'text'.

  • download_path (str) – File path of the downloaded file.

  • url (str) – URL of the file to be downloaded.

Return type

None

static norm(txt: str) str[source]#

Text normalization.

Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.

Parameters

txt (str) – Text to be normalized.

Returns

Normalized text.

Return type

str

See also

unicodedata.normalize

Python built-in unicode normalization.

Examples

>>> from lmp.dset import BaseDset
>>> BaseDset.norm('123456789')
'123456789'