Dataset base class#

class lmp.dset.BaseDset(*, ver: Optional[str] = None)[source]#

Bases: Dataset

Dataset base class.

Most datasets need to be downloaded from the web. Only some of them can be generated locally. Datasets are downloaded / generated automatically if they are not on your local machine. No downloading or generation are executed if dataset files already exist on your local machine.

Parameters: ver (Optional[str], default: None) – Version of the dataset. Set to None to use the default version self.__class__.df_ver.

df_ver#

Default version of the dataset.

Type: ClassVar[str]

dset_name#

CLI name of the dataset. Only used to parse CLI arguments.

Type: ClassVar[str]

spls#

All samples in the dataset.

Type: list[str]

ver#

Version of the dataset.

Type: str

vers#

List of dataset supported versions.

Type: ClassVar[list[str]]

See also

lmp.dset: All available datasets.

__getitem__(idx: int) → str[source]#

Sample text using index.

Parameters: idx (int) – Sample index.
Returns: The sample whose index equals to idx.
Return type: str

__iter__() → Iterator[str][source]#

Iterate through each sample in the dataset.

Yields: str – One sample in self.spls, ordered by sample indices.

__len__() → int[source]#

Get dataset size.

Returns: Number of samples in the dataset.
Return type: int

static download_file(mode: str, download_path: str, url: str) → None[source]#

Download file from url.

Parameters

mode (str) – Can only be 'binary' or 'text'.
download_path (str) – File path of the downloaded file.
url (str) – URL of the file to be downloaded.

Return type

None

static norm(txt: str) → str[source]#

Text normalization.

Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.

Parameters: txt (str) – Text to be normalized.
Returns: Normalized text.
Return type: str