lmp.dset._ch_poem#
Chinese poetry dataset.
- class lmp.dset._ch_poem.ChPoemDset(*, ver: Optional[str] = None)[source]#
Bases:
BaseDsetPoems of ancient Chinese dynasty.
See https://github.com/Werneror/Poetry for details on dataset. See https://github.com/ProFatXuanAll/demo-dataset for dataset preprocessing details.
Here we list some dataset statistics.
dynasty
number of poems
number of authors
宋287114
9446
明236957
4439
清90089
8872
唐49195
2736
元37375
1209
近現代28419
790
當代28219
177
明末清初17700
176
元末明初15736
79
清末民國初15367
99
清末近現代初12464
48
宋末元初12058
41
南北朝4586
434
近現代末當代初3426
23
魏晉3020
251
金末元初3019
17
金2741
253
民國末當代初1948
9
隋1170
84
唐末宋初1118
44
先秦570
8
隋末唐初472
40
漢363
83
宋末金初234
9
遼22
7
秦2
2
魏晉末南北朝初1
1
total
853385
29377
- Parameters
ver (Optional[str], default: None) – Version of the dataset. Set to
Noneto use the default versionself.__class__.df_ver.
- vers#
All available versions of the dataset. Versions are named after their appearing dynasty, including
元,元末明初,先秦,南北朝,唐,唐末宋初,宋,宋末元初,宋末金初,明,明末清初,民國末當代初,清,清末民國初,清末近現代初,漢,當代,秦,近現代,近現代末當代初,遼,金,金末元初,隋,隋末唐初,魏晉,魏晉末南北朝初.
See also
- lmp.dset
All available datasets.
Examples
>>> from lmp.dset import ChPoemDset >>> dset = ChPoemDset(ver='唐') >>> dset[0][:10] '風淅淅。夜雨連雲黑。'
- classmethod download_dataset(ver: str) None[source]#
Download Chinese poem dataset.
Download zip file from GitHub and extract raw file from zip file. Raw file is named as
ver.csv, whereveris the version of the dataset. Zip file is deleted after extracting raw file.- Parameters
ver (str) – Version of the dataset.
- Return type
None
- static norm(txt: str) str#
Text normalization.
Text will be NFKC normalized. Whitespaces are collapsed and strip from both ends.
See also
unicodedata.normalizePython built-in unicode normalization.
Examples
>>> from lmp.dset import BaseDset >>> BaseDset.norm('123456789') '123456789'