-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
471 lines (461 loc) · 19.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
<!DOCTYPE html>
<html lang="yue">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta
name="description"
content="張悦楷講古語音數據集 - The Zoeng Jyut Gaai Storytelling Voice Dataset. A high-quality, artistic and expressive Cantonese speech dataset for TTS, ASR, LLM, and linguistic analysis."
/>
<meta
name="keywords"
content="Cantonese, linguistics, TTS, ASR, LLM, speech synthesis, speech recognition, language model, Yue Chinese, Cantonese Chinese, 粵語, 廣州話, 白話, 廣東話, 張悦楷, 講古語音, 講古佬, 講古, 語音數據集, speech dataset, open-source, public domain, CC0"
/>
<meta name="author" content="張悦楷" />
<meta
property="og:title"
content="張悦楷講古語音數據集 The Zoeng Jyut Gaai Storytelling Voice Dataset"
/>
<meta
property="og:description"
content="開源粵語語音數據集,適合語音識別、語音合成、大語言模型、語言學文學研究等應用 Open-sourced Cantonese voice dataset for ASR, TTS, LLM, linguistics research and more"
/>
<meta
property="og:image"
content="https://canclid.github.io/zoengjyutgaai/zoengjyutgaai.webp"
/>
<meta
property="og:url"
content="https://canclid.github.io/zoengjyutgaai/"
/>
<meta property="og:type" content="website" />
<meta name="twitter:card" content="summary_large_image" />
<link rel="icon" href="zoengjyutgaai.webp" type="image/webp" />
<title>張悦楷講古語音數據集</title>
<script src="https://cdn.tailwindcss.com"></script>
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css"
/>
<script>
tailwind.config = {
theme: {
extend: {
fontFamily: {
kumincho: ["KuMincho", "serif"],
},
},
},
};
</script>
<link href="styles.css" rel="stylesheet" />
</head>
<body class="bg-white">
<div class="container mx-auto my-16 px-4 py-8 max-w-7xl">
<!-- Header -->
<header class="text-center mb-16">
<h1 class="text-5xl text-black mb-4 font-kumincho">
<span class="block mb-8">張悦楷講古語音數據集</span>
The Zoeng Jyut Gaai Storytelling Voice Dataset
</h1>
<p class="text-2xl text-gray-500 my-12">
<span class="block mb-4"
>開源粵語語音數據集,適合語音識別、語音合成、大語言模型、語言學文學研究等應用</span
>
Open-sourced Cantonese voice dataset for ASR, TTS, LLM, linguistics
research and more
</p>
</header>
<!-- Main Content -->
<main class="grid grid-cols-1 sm:grid-cols-3 gap-0">
<!-- Dataset Description -->
<section class="col-span-1 bg-white rounded-lg">
<div class="bg-white rounded-lg text-center my-12">
<h3 class="text-2xl text-gray-500">
授權許可 <br />
License
</h3>
<p class="text-lg font-semibold m-4 text-black">
CC0 公共領域 <br />
Public Domain
</p>
</div>
<div class="bg-white rounded-lg text-center mt-4 mb-12">
<h3 class="text-2xl text-gray-500">
語言 <br />
Language
</h3>
<p class="text-lg font-semibold m-4 text-black">
粵語 <br />
Cantonese <br />
ISO 639-3: <code>yue</code>
</p>
</div>
<div class="bg-white rounded-lg text-center my-12">
<h3 class="text-2xl text-gray-500">
總時長 <br />
Total Duration
</h3>
<p class="text-lg font-semibold m-4 text-black">
66.01 個鐘 hours <br />(3960.73 分鐘 minutes)
</p>
</div>
<div class="bg-white rounded-lg text-center my-12">
<h3 class="text-2xl text-gray-500">
總字數(含標點) <br />
Total Characters # (including punctuation)
</h3>
<p class="text-lg font-semibold m-4 text-black">946176</p>
</div>
<div class="bg-white rounded-lg text-center my-12">
<h3 class="text-2xl text-gray-500">
發音人 <br />
Voice Actor
</h3>
<p class="text-lg font-semibold m-4 text-black">張悦楷</p>
</div>
</section>
<!-- Dataset Stats -->
<section class="col-span-2 mb-12">
<h2 class="text-3xl my-8">介紹 Introduction</h2>
<p class="text-gray-700 text-xl mb-4">
本數據集由廣州最出名嘅話劇演員、説書藝人(講古佬)張悦楷喺 1980
年代電台播講《三國演義》嘅錄音製成。數據集所有文本均由人工轉寫,並根據《三國演義》原文校對嚟確保準確性。
</p>
<p class="text-gray-700 text-xl my-4">
This dataset was made from recordings of Zoeng Jyut Gaai, the most
famous drama actor and storyteller in Canton, storytelling
<em>Romance of the Three Kingdoms</em> during the 1980s. All texts
in the dataset were transcribed manually and proofread according to
the original text of <em>Romance of the Three Kingdoms </em> to
ensure accuracy.
</p>
<p class="text-gray-700 text-xl my-4">
本數據集可用於各種用途,例如語音合成(TTS)、語音識別(ASR)、語言模型(LLM)、語言學分析等等。<a
href="https://huggingface.co/spaces/laubonghaudoi/zoengjyutgaai_tts"
class="underline"
>
張悦楷語音合成 </a
>就係一個用本數據集訓練出嚟嘅 TTS 系統。
</p>
<p class="text-gray-700 text-xl my-4">
This dataset is multi-purposed. It can be used for Text-To-Speech
(TTS), Automatic Speech Recognition (ASR), Language Modeling,
linguistics analysis, etc. As an example,
<a
href="https://huggingface.co/spaces/laubonghaudoi/zoengjyutgaai_tts"
class="underline"
>
張悦楷語音合成
</a>
is a TTS system trained on this dataset.
</p>
<h2 class="text-3xl my-12">數據樣例 Data samples</h2>
<div class="px-8 py-4 mb-8 border-solid border-black border-2">
<div class="my-4">
<audio controls class="w-full">
<source src="029_201.wav" type="audio/wav" />
瀏覽器唔支援音頻
</audio>
<p class="text-xl my-4 text-gray-700">
當今天下嘅英雄,就係使君你,同我喇。
</p>
</div>
<div class="my-8">
<audio controls class="w-full">
<source src="074_222.wav" type="audio/wav" />
瀏覽器唔支援音頻
</audio>
<p class="text-xl my-4 text-gray-700">
唉!既生瑜,何生亮!既生瑜,何生亮!既生瑜,何生亮啊!
</p>
</div>
<div class="my-4">
<audio controls class="w-full">
<source src="121_097.wav" type="audio/wav" />
瀏覽器唔支援音頻
</audio>
<p class="text-xl my-4 text-gray-700">
王朗講完,孔明喺架車上哈哈大笑佢話:哈哈哈哈哈哈哈哈,我仲以為堂堂漢朝嘅大老元臣,所講嘅道理必定十分高明嘅,點估到竟然如此卑鄙啊!
</p>
</div>
</div>
<h2 class="text-3xl my-16">下載 Download</h2>
<div class="flex justify-center my-16">
<a
href="https://huggingface.co/datasets/CanCLID/zoengjyutgaai_saamgwokjinji"
target="_blank"
class="bg-yellow-300 text-black text-xl px-8 py-4 hover:bg-black hover:text-white transition-colors"
>
前往 🤗 Hugging Face 下載
</a>
</div>
<p class="text-gray-700 text-xl">
如果你想單純克隆所有 wav 文件,可以用下面嘅命令嚟凈係克隆個
<code>wav/</code> 路徑,避免 clone 晒成個 repo:
</p>
<p class="text-gray-700 text-xl my-4">
If you want to clone only the wav files without cloning the entire
repo, use the following commands to clone the
<code>wav/</code> directory only:
</p>
<pre
class="text-nowrap p-4 bg-gray-100 overflow-auto my-4"
><code>mkdir zoengjyutgaai_saamgwokjinji
cd zoengjyutgaai_saamgwokjinji
git init
git remote add origin https://huggingface.co/datasets/CanCLID/zoengjyutgaai_saamgwokjinji
git sparse-checkout init --cone
# 指定凈係下載個別路徑 Tell git which directory you want
git sparse-checkout set wav
# 開始下載 Pull the content
git pull origin main</code></pre>
<p class="text-gray-700 text-xl my-4">
所有文字轉寫都喺 <code>wav/metadata.csv</code>入面。
</p>
<p class="text-gray-700 text-xl my-4">
All text transcriptions are in
<code>wav/metadata.csv</code>.
</p>
<h2 class="text-3xl my-12">説明 Info</h2>
<p class="text-xl mb-4">
所有源字幕 SRT 文件都存放喺 Hugging Face
倉庫嘅<code>srt/</code>路經下。所有源音頻都以 .webm 格式放喺
<code>.webm/</code> 路經下。
</p>
<p class="text-xl my-4">
All source subtitle SRT files are stored in the
<code>srt/</code> directory of the Hugging Face repository. All
source audio are stored in .webm format in the
<code>.webm/</code> directory.
</p>
<ul class="text-xl my-4 px-4 list-disc">
<li>
所有文本都根據
<a href="https://jyutping.org/blog/typo/" class="underline"
>jyutping.org/blog/typo</a
>
同
<a href="https://jyutping.org/blog/particles/" class="underline">
jyutping.org/blog/particles/
</a>
規範用字
</li>
<li>所有文本都使用全角標點,冇半角標點</li>
<li>所有文本都用漢字轉寫,無阿拉伯數字無英文字母</li>
<li>所有音頻源都存放喺<code>/webm</code>下面</li>
</ul>
<ul class="text-xl my-4 px-4 list-disc">
<li>
All text are standardized with the orthography in
<a href="https://jyutping.org/blog/typo/" class="underline"
>jyutping.org/blog/typo</a
>
and
<a
href="https://jyutping.org/blog/typo/"
class="underline"
href="https://jyutping.org/blog/particles/"
>
jyutping.org/blog/particles/
</a>
</li>
<li>
All text use full-width punctuations and has no half-width
punctuations.
</li>
<li>
All text is in Chinese characters, no Latin letters or Arabic
numbers.
</li>
<li>
All source audios are stored in
<code>/webm</code>.
</li>
</ul>
<h2 class="text-3xl my-12">數據統計 Statistics</h2>
<table
class="table-auto w-full my-8 border-2 border-black border-collapse"
>
<thead></thead>
<tbody>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
總時長 Total Duration
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
66.01 個鐘 hours(3960.73 分鐘 minutes)
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
平均音頻時長 Average Clip Duration
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
6.065 秒 seconds
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
中位音頻時長 Median Clip Duration
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
5.606 秒 seconds
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
最短音頻時長 Min Clip Duration
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
0.339 秒 seconds
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
最長音頻時長 Max Clip Duration
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
31.822 秒 seconds
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
平均每句字數(含標點) Average Characters Per Clip (including
punctuation)
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
24.00
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
中位每句字數(含標點) Median Characters Per Clip (including
punctuation)
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
23
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
文本總字數(含標點) Total Characters # (including
punctuation)
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
946176
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
覆蓋漢字數 Unique Chinese Characters Coverage
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
3988
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
平均語速(含標點) Average Speaking Rate (including
punctuation)
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
3.98 字/秒 characters per second
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
採樣率 Sampling Rate
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
44100 Hz
</td>
</tr>
<tr>
<td class="border px-4 py-2 border-black border-0 text-lg">
音頻文件格式 Audio file format
</td>
<td class="border px-4 py-2 border-black border-0 text-lg">
.wav
</td>
</tr>
</tbody>
</table>
<h2 class="text-3xl my-12">引用 Citation</h2>
<p class="text-gray-700 text-xl">
本數據集屬公共領域,遵循
<a href="https://creativecommons.org/public-domain/cc0/">CC0</a>
許可聲明。即係話你可以無需授權免費任用本數據集,亦都唔需要註明出處。不過如果你用咗本數據集,我哋都希望你可以引用本頁面,作為對楷叔嘅懷念同致敬:
</p>
<p class="text-gray-700 text-xl my-4">
This dataset is in the public domain and follows the
<a href="https://creativecommons.org/public-domain/cc0/">CC0</a>
license agreement. This means you can use this dataset for free
without attribution. However, if you use this dataset, we hope you
can cite this page as a tribute to Gaai Suk:
</p>
<pre class="bg-gray-100 p-4 rounded-lg overflow-x-auto">
@misc{zoengjyutgaai2025,
title={張悦楷講古語音數據集 The Zoeng Jyut Gaai Storytelling Voice Dataset},
author={粵語計算語言學基礎建設組 Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID)},
howpublished = {\url{https://canclid.github.io/zoengjyutgaai/}},
year={2025}
}</pre
>
<h2 class="text-3xl my-12">意見反饋 Feedback</h2>
<p class="text-gray-700 text-xl my-4">
數據集建設難免有疏漏,如果你發現有任何錯誤、問題,或者有任何意見,歡迎喺
<a
class="underline"
href="https://huggingface.co/datasets/CanCLID/zoengjyutgaai_saamgwokjinji/discussions"
>
Hugging Face 討論區 </a
>提出。
</p>
<p class="text-gray-700 text-xl my-4">
Dataset construction is inevitably flawed. If you find any errors,
problems, or have any suggestions, feel free to raise them in the
<a
class="underline"
href="https://huggingface.co/datasets/CanCLID/zoengjyutgaai_saamgwokjinji/discussions"
>
Hugging Face discussion forum</a
>.
</p>
</section>
</main>
<!-- Footer -->
<footer class="mt-12 text-center">
<img
src="zoengjyutgaai.webp"
alt="張悦楷"
class="w-64 h-64 rounded-full mx-auto mb-16 object-cover"
/>
<div class="flex flex-col items-center gap-4">
<div class="flex items-center gap-4">
<a
href="https://github.com/CanCLID"
target="_blank"
class="text-gray-600 hover:text-black"
title="GitHub"
>
<i class="fab fa-github text-2xl"></i>
</a>
<a
href="https://twitter.com/Can_CLID"
target="_blank"
class="text-gray-600 hover:text-black"
title="Twitter"
>
<i class="fab fa-twitter text-2xl"></i>
</a>
</div>
<p class="text-gray-600 text-lg">
粵語計算語言學基礎建設組 Cantonese Computational Linguistics
Infrastructure Development Workgroup (CanCLID)
</p>
</div>
</footer>
</div>
</body>
</html>