lmp.dset._ch_poem
#
Chinese poetry dataset.
- class lmp.dset._ch_poem.ChPoemDset(*, ver: Optional[str] = None)[source]#
Bases:
BaseDset
Poems of ancient Chinese dynasty.
See https://github.com/Werneror/Poetry for details on dataset. See https://github.com/ProFatXuanAll/demo-dataset for dataset preprocessing details.
Here we list some dataset statistics.
dynasty
number of poems
number of authors
宋
287114
9446
明
236957
4439
清
90089
8872
唐
49195
2736
元
37375
1209
近現代
28419
790
當代
28219
177
明末清初
17700
176
元末明初
15736
79
清末民國初
15367
99
清末近現代初
12464
48
宋末元初
12058
41
南北朝
4586
434
近現代末當代初
3426
23
魏晉
3020
251
金末元初
3019
17
金
2741
253
民國末當代初
1948
9
隋
1170
84
唐末宋初
1118
44
先秦
570
8
隋末唐初
472
40
漢
363
83
宋末金初
234
9
遼
22
7
秦
2
2
魏晉末南北朝初
1
1
total
853385
29377
- Parameters
ver (Optional[str], default: None) – Version of the dataset. Set to
None
to use the default versionself.__class__.df_ver
.
- vers#
All available versions of the dataset. Versions are named after their appearing dynasty, including
元
,元末明初
,先秦
,南北朝
,唐
,唐末宋初
,宋
,宋末元初
,宋末金初
,明
,明末清初
,民國末當代初
,清
,清末民國初
,清末近現代初
,漢
,當代
,秦
,近現代
,近現代末當代初
,遼
,金
,金末元初
,隋
,隋末唐初
,魏晉
,魏晉末南北朝初
.
See also
- lmp.dset
All available datasets.
Examples
>>> from lmp.dset import ChPoemDset >>> dset = ChPoemDset(ver='唐') >>> dset[0][:10] '風淅淅。夜雨連雲黑。'
- classmethod download_dataset(ver: str) None [source]#
Download Chinese poem dataset.
Download zip file from GitHub and extract raw file from zip file. Raw file is named as
ver.csv
, wherever
is the version of the dataset. Zip file is deleted after extracting raw file.- Parameters
ver (str) – Version of the dataset.
- Return type
None
- static norm(txt: str) str #
Text normalization.
Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.
See also
unicodedata.normalize
Python built-in unicode normalization.
Examples
>>> from lmp.dset import BaseDset >>> BaseDset.norm('123456789') '123456789'