Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Corpus] Common crawl ko #184

Open
lovit opened this issue Nov 24, 2020 · 3 comments
Open

[Corpus] Common crawl ko #184

lovit opened this issue Nov 24, 2020 · 3 comments

Comments

@lovit
Copy link
Member

lovit commented Nov 24, 2020

http://data.statmt.org/cc-100/

이 내용은 #187 에 반영하도록 하겠습니다

@lovit
Copy link
Member Author

lovit commented Jan 24, 2021

  • cc-100 corpus 의 다운로드 속도가 빠른 편이 아닙니다. mirroring 이 가능한지 알아봅니다.

@lovit
Copy link
Member Author

lovit commented Jan 24, 2021

cc-100 데이터는 LANG.txt.xz 형식으로 제공되며, xz 파일을 unpack 하기 위해 lzma Python package 를 이용합니다. pyenv를 이용할 때 다음의 오류가 발생할 수 있습니다.

from _lzma import *

ModuleNotFoundError: No module named '_lzma' 

Python 이 아닌, pyenv 의 오류로, 다음처럼 해결하면 됩니다.

(MacOS)

pyenv uninstall x.y.z
brew install xz
pyenv install x.y.z

(Ubuntu)

pyenv uninstall x.y.z
sudo apt-get install lzma
pyenv install x.y.z

@lovit
Copy link
Member Author

lovit commented Jan 24, 2021

  • 데이터 통계
file size num lines num words num characters
ko.txt.xz 13G - - -
ko.txt 54G 390,127,563 6,865,713,849 58,150,167,722

lovit added a commit that referenced this issue Jan 24, 2021
lovit added a commit that referenced this issue Jan 24, 2021
lovit added a commit that referenced this issue Jan 27, 2021
lovit added a commit that referenced this issue Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant