`lmp.dset._ch_poem`#

Chinese poetry dataset.

class lmp.dset._ch_poem.ChPoemDset(*, ver: Optional[str] = None)[source]#

Bases: BaseDset

Poems of ancient Chinese dynasty.

See https://github.com/Werneror/Poetry for details on dataset. See https://github.com/ProFatXuanAll/demo-dataset for dataset preprocessing details.

Here we list some dataset statistics.

dynasty	number of poems	number of authors
`宋`	287114	9446
`明`	236957	4439
`清`	90089	8872
`唐`	49195	2736
`元`	37375	1209
`近現代`	28419	790
`當代`	28219	177
`明末清初`	17700	176
`元末明初`	15736	79
`清末民國初`	15367	99
`清末近現代初`	12464	48
`宋末元初`	12058	41
`南北朝`	4586	434
`近現代末當代初`	3426	23
`魏晉`	3020	251
`金末元初`	3019	17
`金`	2741	253
`民國末當代初`	1948	9
`隋`	1170	84
`唐末宋初`	1118	44
`先秦`	570	8
`隋末唐初`	472	40
`漢`	363	83
`宋末金初`	234	9
`遼`	22	7
`秦`	2	2
`魏晉末南北朝初`	1	1
total	853385	29377

Parameters: ver (Optional[str], default: None) – Version of the dataset. Set to None to use the default version self.__class__.df_ver.

df_ver#

Default version is '唐'.

Type: ClassVar[str]

dset_name#

CLI name of Chinese poem dataset is chinese-poem.

Type: ClassVar[str]

spls#

All samples in the dataset.

Type: list[str]

ver#

Version of the dataset.

Type: str

vers#

All available versions of the dataset. Versions are named after their appearing dynasty, including 元, 元末明初, 先秦, 南北朝, 唐, 唐末宋初, 宋, 宋末元初, 宋末金初, 明, 明末清初, 民國末當代初, 清, 清末民國初, 清末近現代初, 漢, 當代, 秦, 近現代, 近現代末當代初, 遼, 金, 金末元初, 隋, 隋末唐初, 魏晉, 魏晉末南北朝初.

Type: ClassVar[list[str]]

See also

lmp.dset: All available datasets.

Examples

>>> from lmp.dset import ChPoemDset
>>> dset = ChPoemDset(ver='唐')
>>> dset[0][:10]
'風淅淅。夜雨連雲黑。'

classmethod download_dataset(ver: str) → None[source]#

Download Chinese poem dataset.

Download zip file from GitHub and extract raw file from zip file. Raw file is named as ver.csv, where ver is the version of the dataset. Zip file is deleted after extracting raw file.

Parameters: ver (str) – Version of the dataset.
Return type: None

static download_file(mode: str, download_path: str, url: str) → None#

Download file from url.

Parameters

mode (str) – Can only be 'binary' or 'text'.
download_path (str) – File path of the downloaded file.
url (str) – URL of the file to be downloaded.

Return type

None

static norm(txt: str) → str#

Text normalization.

Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.

Parameters: txt (str) – Text to be normalized.
Returns: Normalized text.
Return type: str

lmp.dset._ch_poem#

`lmp.dset._ch_poem`#