Skip to content

Wikitext format dataset of Namuwiki (Most famous Korean wikipedia)

Notifications You must be signed in to change notification settings

lovit/namuwikitext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Namuwikitext

Wikitext format Korean corpus

나무위키의 덤프 데이터를 바탕을 제작한 wikitext 형식의 텍스트 파일입니다. 학습 및 평가를 위하여 위키페이지 별로 train (99%), dev (0.5%), test (0.5%) 로 나뉘어져있습니다.

Corpus size

  • train: 31235096 lines (500104 docs, 4.6G)
  • dev: 153605 lines (2525 docs, 23M)
  • test: 160233 lines (2527 docs, 24M)

To fetch data, run below script. Then three corpus, train / dev / test files are downloaded at ./data/

python fetch.py

This corpus is licensed with CC BY-NC-SA 2.0 KR which Namuwiki is licensed. For detail, visit https://creativecommons.org/licenses/by-nc-sa/2.0/kr/

Fetch and load using Korpora

Korpora is Korean Corpora Archives, implemented based on Python. We provide the fetch / load function at Korpora

이 코퍼스는 Korpora 프로젝트에서 사용할 수 있습니다.

from Korpora import Korpora

namuwikitext = Korpora.load('namuwikitext')

# or
Korpora.fetch('namuwikitext')

License

"CC BY-NC-SA 2.0 KR which Namuwiki dump dataset is licensed

About

Wikitext format dataset of Namuwiki (Most famous Korean wikipedia)

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages