lmp.dset._ch_poem#

Chinese poetry dataset.

class lmp.dset._ch_poem.ChPoemDset(*, ver: Optional[str] = None)[source]#

Bases: BaseDset

Poems of ancient Chinese dynasty.

See https://github.com/Werneror/Poetry for details on dataset. See https://github.com/ProFatXuanAll/demo-dataset for dataset preprocessing details.

Here we list some dataset statistics.

dynasty

number of poems

number of authors

287114

9446

236957

4439

90089

8872

49195

2736

37375

1209

近現代

28419

790

當代

28219

177

明末清初

17700

176

元末明初

15736

79

清末民國初

15367

99

清末近現代初

12464

48

宋末元初

12058

41

南北朝

4586

434

近現代末當代初

3426

23

魏晉

3020

251

金末元初

3019

17

2741

253

民國末當代初

1948

9

1170

84

唐末宋初

1118

44

先秦

570

8

隋末唐初

472

40

363

83

宋末金初

234

9

22

7

2

2

魏晉末南北朝初

1

1

total

853385

29377

Parameters

ver (Optional[str], default: None) – Version of the dataset. Set to None to use the default version self.__class__.df_ver.

df_ver#

Default version is '唐'.

Type

ClassVar[str]

dset_name#

CLI name of Chinese poem dataset is chinese-poem.

Type

ClassVar[str]

spls#

All samples in the dataset.

Type

list[str]

ver#

Version of the dataset.

Type

str

vers#

All available versions of the dataset. Versions are named after their appearing dynasty, including , 元末明初, 先秦, 南北朝, , 唐末宋初, , 宋末元初, 宋末金初, , 明末清初, 民國末當代初, , 清末民國初, 清末近現代初, , 當代, , 近現代, 近現代末當代初, , , 金末元初, , 隋末唐初, 魏晉, 魏晉末南北朝初.

Type

ClassVar[list[str]]

See also

lmp.dset

All available datasets.

Examples

>>> from lmp.dset import ChPoemDset
>>> dset = ChPoemDset(ver='唐')
>>> dset[0][:10]
'風淅淅。夜雨連雲黑。'
classmethod download_dataset(ver: str) None[source]#

Download Chinese poem dataset.

Download zip file from GitHub and extract raw file from zip file. Raw file is named as ver.csv, where ver is the version of the dataset. Zip file is deleted after extracting raw file.

Parameters

ver (str) – Version of the dataset.

Return type

None

static download_file(mode: str, download_path: str, url: str) None#

Download file from url.

Parameters
  • mode (str) – Can only be 'binary' or 'text'.

  • download_path (str) – File path of the downloaded file.

  • url (str) – URL of the file to be downloaded.

Return type

None

static norm(txt: str) str#

Text normalization.

Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.

Parameters

txt (str) – Text to be normalized.

Returns

Normalized text.

Return type

str

See also

unicodedata.normalize

Python built-in unicode normalization.

Examples

>>> from lmp.dset import BaseDset
>>> BaseDset.norm('123456789')
'123456789'