모두의 말뭉치 loader #103

lovit · 2020-10-04T11:49:44Z

fetch 기능이 없으므로 korpora_modu.py 에 load 기능이 포함된 Loader classes 를 만듭니다.
각 클래스의 이름은 ModuXXXKorpus (e.g ModuNewsKorpus) 형식을 따릅니다.

말뭉치 종류	ongoing	finished	note
신문 말뭉치	x	x
문어 말뭉치	x	x
구어 말뭉치	x	x
메신저 말뭉치	x	x
웹 말뭉치	x	x
문서 요약 말뭉치			매쉬업 이슈로 `Korpora==0.2.0` 에서 지원안함 (#124)
형태 분석 말뭉치	x	x
어휘 의미 분석 말뭉치			정합성 이슈로 `Korpora=0.2.0` 에서 지원안함 (#123), 세부 커멘트
개체명 분석 말뭉치	x	x
구문 분석 말뭉치			트리 구조를 어떻게 제공할지 논의가 필요하여 `Korpora==0.2.0` 에서 지원안함
문법성 판단 말뭉치			tsv 파일은 `Korpora==0.2.0` 에서 지원안함
유사 문장 말뭉치			매쉬업 이슈로 `Korpora==0.2.0` 에서 지원안함 (#124)
어휘 관계 자료			tsv 파일은 `Korpora==0.2.0` 에서 지원안함

The text was updated successfully, but these errors were encountered:

lovit · 2020-10-04T17:55:47Z

en	ko	파일 형식
NIKL_CoLA(v1.0)	국립국어원 문법성 판단 말뭉치 (버전 1.0)	4 개의 tsv 파일로 제공
NIKL_DP(v1.0)	국립국어원 구문 분석 말뭉치 (1.0)	`NXDP1902008051.json` 단일파일로 제공
NIKL_LS(v1.0)	국립국어원 어휘 의미 분석 말뭉치 (버전 1.0)	`NXLS1902008050.json`, `SXLS1902008030.json` 파일로 제공
NIKL_MESSENGER(v1.0)	국립국어원 메신저 말뭉치 (버전 1.0)	`MDRWxxx.json`, `MMRWxxx.json` 파일로 제공
NIKL_MP(v1.0)	국립국어원 형태 분석 말뭉치 (버전 1.0)	`NXMP1902008040.json`, `SXMP1902008031.json` 파일로 제공
NIKL_NE(v1.0)	국립국어원 객체명 분석 말뭉치 (버전 1.0)	`NXNE1902008030.json`, `SXNE1902007240.json` 두 파일로 제공
NIKL_NEWSPAPER(v1.0)	국립국어원 신문 말뭉치 (버전 1.0)	각 기사가 `N*RWxxx.json` 파일로 제공
NIKL_NIKLex(v1.0)	국립국어원 어휘 관계 자료 (버전 1.0)	4 개의 tsv 파일로 제공
NIKL_PARAPHRASE(v1.0)	국립국어원 유사 문장 말뭉치 (버전 1.0)	`NIKL_PC.json` 단일파일로 제공
NIKL_SPOKEN(v1.0)	국립국어원 구어 말뭉치 (버전 1.0)	`SARWxxx.json`, `SBRWxxx.json`, `SDRWxxx.json`, `SERWxxx.json` 파일로 제공
NIKL_SUMMARIZATION(v1.0)	국립국어원 문서 요약 말뭉치 (버전 1.0)	`NIKL_SC.json` 단일파일로 제공
NIKL_WEB(v1.0)	국립국어원 웹 말뭉치 (버전 1.0)	각 문서가 `EBRWxxx.json`, `EPRWxxx.json`, `ESRWxxx.json`, `ERRWxxx.json` 파일로 제공
NIKL_WRITTEN(v1.0)	국립국어원 문어 말뭉치 (버전 1.0)	`WARWxxx.json`, `WBRWxxx.json`, `WCRWxxx.json`, `WZRWxxx.json` 파일로 제공

lovit · 2020-10-04T18:59:41Z

위 commit 은 다음의 기능을 제공합니다.

from Korpora.korpus_modu import ModuNewsKorpus

news_corpus = ModuNewsKorpus(['path/to/NIKL_NEWSPAPER(v1.0)/NPRW1900000013.json'])
print(news_corpus.train)

모두의 말뭉치: 뉴스 말뭉치: size=13581
  - 모두의 말뭉치: 뉴스 말뭉치.name : list[str]
  - 모두의 말뭉치: 뉴스 말뭉치.texts : list[ModuNewsLight]
  - 모두의 말뭉치: 뉴스 말뭉치.news : list[ModuNewsLight]

news_corpus.train[0]

ModuNewsLight(document_id='NPRW1900000013.1', title='비앤티신문 2011년 기사', paragraph="'탈모' 치료하고 '여심' 잡는 방법?\n시끌벅적한 연말 분위기가 수그러들자 신묘년 새해가 밝았다. 사람들은 새해 첫날부터 의지에 불타 이런저런 계획을 세우기 바쁘다. 이렇게 활기가 넘치는 연휴에 황철민(33세, 가명) 씨는 혼자 방에 틀어박혀 텔레비전을 벗 삼아 울적한 기분을 달래고 있다.\n2년 가까이 사귄 애인에게 올해 1월1일 00시에 청혼하기로 마음먹은 그는 설레는 마음으로 이벤트를 준비했다. 기대가 크면 실망은 더 큰 법. 그는 승낙을 얻어 내기는커녕 이별을 통보받은 것이다.\n“재작년인가… 잘 기억은 안 나지만 서른이 지나고부터 탈모가 시작된 것 같습니다. 머리숱도 많고 모발도 굵은 편이라서 평소보다 조금 많이 빠져도 별 신경을 쓰지 않았습니다. 6~7개월 전까지만 해도 지금처럼 정수리가 드러날 줄은 꿈에도 몰랐거든요. 여자 친구가 별말이 없어서 고마워하고 있었는데 헤어지던 날 대머리가 될까 봐 결혼이 싫다는 말을 듣게 됐습니다”라고 말했다.\n남성 탈모는 정수리부터 머리가 빠지기 시작해 대머리가 된다. 반면 여성 탈모는 전체적으로 숱이 줄어든다는 특징이 있다. 하지만 남녀노소를 불문하고 탈모는 스트레스의 원인이자 콤플렉스로 작용한다.\n특히 남성 탈모 환자는 실제보다 더 나이가 들어 보이고, 젊은 사람은 사회생활에 어려움을 주기도 한다. 심각한 경우에는 자신감을 잃고 우울한 기분에 빠질 수도 있다. 때문에 평소 꼼꼼한 두피 관리를 해주는 것이 중요하다.\n머리를 감을 때는 두피에 손상을 주지 않도록 손가락으로 마사지하듯 샴푸하고 헤어제품의 찌꺼기가 남지 않도록 충분히 헹궈준다. 헤어드라이나 헤어스트레이터의 사용은 자제하고 요즘 같이 추운 겨울에는 두피도 건조해지므로 보습관리를 해줘야 한다. 또 계란, 검은콩, 검은깨, 흑미 등 미네랄과 단백질을 많이 섭취하는 것도 도움이 된다.\n탈모가 시작됐더라도 진행을 늦춰주거나 상태를 호전시킬 수 있는 치료를 빨리 시작하는 것이 좋다. 머리가 빠지기 시작했다고 포기하고 내버려두면 상태가 더 악화될 수 있다. 고주파나 먹는 약, 바르는 약 등의 복합적인 치료가 필요하다.\n태전약품의 '드로젠 정'은 약사들이 추천하는 명약에 선정된 만큼 믿고 안심할 수 있다. 또한 간편한 복용만으로도 탈모를 예방하고 치료하는 효과가 있어 사람들이 선호하는 제품이기도 하다.\n드로젠 정은 탈모증 치료제로 양약과 감초와 같은 생약성분이 혼합되어 있어 탈모를 예방하고 모발의 성장을 촉진시켜준다. 비타민 성분이 두피의 말초혈관에 작용하여 모발에 산소와 영양분을 공급하여 탈모를 예방하고, 건강한 머리를 유지시켜 주는 것으로 알려져 있다.\n다른 탈모제와 비교했을 때, 마이녹실(minoxidil)이나 프로페시아(finasteride) 등 여성에게 부작용을 일으킬 수 있는 성분을 함유하고 있지 않아, 원형 탈모증, 비강성 탈모증 등의 여성 탈모 증상에 효과적이다. 3~6개월 꾸준히 영양제처럼 장기복용 할 경우 탈모방지에 더욱 효과가 좋다고 한다.")

lovit · 2020-10-04T19:10:13Z

위 commit 및 파일 형식을 정리한 comment 를 통하여 세 가지 논의사항이 생겼습니다.

뉴스 말뭉치 만으로도 약 70줄의 코드가 작성되었습니다. 뉴스 말뭉치 포함 13개 말뭉치를 하나의 파일에 모두 포함하는 것이 좋은 선택일지 고민이 됩니다.
13개 말뭉치의 파일 형식이 대부분 JSON 이지만, 일부가 tsv 이기도 합니다. 또한 JSON 내부의 keys 가 다릅니다. 이번 버전 배포 시, 모두의 말뭉치 중 일부만 loader 를 제공하는 것이 어떨까 싶습니다.
JSON 형식이다보니 불필요한 정보들이 많습니다. 원 데이터를 수정하지 않으며 로딩하는 것이 korpora 의 기본 원칙이지만, 실제 사용을 위해서는 말뭉치를 정제하는 스크립트를 제공하는 것이 더 편리하다고 생각됩니다. 특히 문서 요약 말뭉치는 원문서를 신문 말뭉치 에서 직접 찾아서 가져와야 하는 형태인데, 문서 요약 말뭉치를 로딩할 때마다 신문 말뭉치 를 모두 열어 원 문서를 탐색하는 것은 매우 비효율적입니다 (압축 해제 시 신문 말뭉치의 크기는 약 16G 입니다). 원 데이터를 수정하지 않는다는 원칙은 가능한 지켜야 한다고 생각하며, 필요에 따라 ko-nlp 팀과 독립인 레포지토리에서 모두의 말뭉치 정제 스크립트를 제공하는 것이 어떨까 싶습니다. 이러한 이유에 따라 2번에서 언급한것처럼 정제 스크립트가 필요없는 일부 말뭉치에 대해서만 loader 제공을 제안합니다.

ratsgo · 2020-10-05T00:52:14Z

뉴스 말뭉치 만으로도 약 70줄의 코드가 작성되었습니다. 뉴스 말뭉치 포함 13개 말뭉치를 하나의 파일에 모두 포함하는 것이 좋은 선택일지 고민이 됩니다.

말뭉치 구조가 생각지 못했던 상황이네요. 네 저도 기존 말뭉치들처럼 파일 1개당 하나의 말뭉치를 처리하는 쪽으로 개발하면 좋을 것 같다는 생각입니다.

ratsgo · 2020-10-05T00:52:45Z

13개 말뭉치의 파일 형식이 대부분 JSON 이지만, 일부가 tsv 이기도 합니다. 또한 JSON 내부의 keys 가 다릅니다. 이번 버전 배포 시, 모두의 말뭉치 중 일부만 loader 를 제공하는 것이 어떨까 싶습니다.

JSON 형식이다보니 불필요한 정보들이 많습니다. 원 데이터를 수정하지 않으며 로딩하는 것이 korpora 의 기본 원칙이지만, 실제 사용을 위해서는 말뭉치를 정제하는 스크립트를 제공하는 것이 더 편리하다고 생각됩니다. 특히 문서 요약 말뭉치는 원문서를 신문 말뭉치 에서 직접 찾아서 가져와야 하는 형태인데, 문서 요약 말뭉치를 로딩할 때마다 신문 말뭉치 를 모두 열어 원 문서를 탐색하는 것은 매우 비효율적입니다 (압축 해제 시 신문 말뭉치의 크기는 약 16G 입니다). 원 데이터를 수정하지 않는다는 원칙은 가능한 지켜야 한다고 생각하며, 필요에 따라 ko-nlp 팀과 독립인 레포지토리에서 모두의 말뭉치 정제 스크립트를 제공하는 것이 어떨까 싶습니다. 이러한 이유에 따라 2번에서 언급한것처럼 정제 스크립트가 필요없는 일부 말뭉치에 대해서만 loader 제공을 제안합니다.

네 저도 정제 스크립트가 필요없는 일부 말뭉치에 대해서만 loader 제공에 동의합니다. 코포라의 기능은 (1) 말뭉치를 다운로드 (2) 데이터 그대로 읽어들이기에 한정돼야 한다고 생각합니다. 이 범위를 넘어서는 것들은 별도 프로젝트로 빼는 것이 좋다고 봅니다(예: metrics).

lovit · 2020-10-05T04:51:02Z

말뭉치 구조가 생각지 못했던 상황이네요. 네 저도 기존 말뭉치들처럼 파일 1개당 하나의 말뭉치를 처리하는 쪽으로 개발하면 좋을 것 같다는 생각입니다.

위의 커밋을 통하여 다음을 반영하였습니다.

lovit · 2020-10-05T04:53:46Z

네 저도 정제 스크립트가 필요없는 일부 말뭉치에 대해서만 loader 제공에 동의합니다. 코포라의 기능은 (1) 말뭉치를 다운로드 (2) 데이터 그대로 읽어들이기 에 한정돼야 한다고 생각합니다. 이 범위를 넘어서는 것들은 별도 프로젝트로 빼는 것이 좋다고 봅니다(예: metrics).

이 부분은 ko-nlp/moducorpus-sanitizer 에서 진행하겠습니다.

lovit · 2020-10-14T19:47:34Z

Manual loading test code

pip install colored

import sys
sys.path.insert(0, '../')

import contextlib
import os
import sys
from colored import fg, bg, attr


from Korpora.korpus_modu_news import ModuNewsKorpus
from Korpora.korpus_modu_messenger import ModuMessengerKorpus
from Korpora.korpus_modu_morpheme import ModuMorphemeKorpus
from Korpora.korpus_modu_ne import ModuNEKorpus
from Korpora.korpus_modu_spoken import ModuSpokenKorpus
from Korpora.korpus_modu_web import ModuWebKorpus
from Korpora.korpus_modu_written import ModuWrittenKorpus


## SET ARGUMENT ##
CUSTOM_DIR = ''


@contextlib.contextmanager
def nostdout():
    save_stdout = sys.stdout
    sys.stdout = open(os.devnull, "w")
    yield
    sys.stdout = save_stdout



corpora_name = [
    (ModuNewsKorpus, 'NIKL_NEWSPAPER(v1.0)'),
    (ModuMessengerKorpus, 'NIKL_MESSENGER(v1.0)'),
    (ModuMorphemeKorpus, 'NIKL_MP(v1.0)'),
    (ModuNEKorpus, 'NIKL_NE(v1.0)'),
    (ModuSpokenKorpus, 'NIKL_SPOKEN(v1.0)'),
    (ModuWebKorpus, 'NIKL_WEB(v1.0)'),
    (ModuWrittenKorpus, 'NIKL_WRITTEN(v1.0)')
]


for corpus, dirname in corpora_name:
    classname = corpus.__class__.__name__
    with nostdout():
        corpus()
    print(f'{fg(2)} passed {classname} with default dir {attr(0)}')
    with nostdout():
        corpus(f'{CUSTOM_DIR}/{dirname}/')
    print(f'{fg(2)} passed {classname} with custom dir {attr(0)}')

lovit added a commit that referenced this issue Oct 4, 2020

Implement News corpus (#103)

362d984

lovit added a commit that referenced this issue Oct 5, 2020

Add document_id to row index mapper (#103)

f1d4ad0

lovit added a commit that referenced this issue Oct 5, 2020

Rename: A modu corpus, a file (#103)

8d9011d

lovit added the On Progress label Oct 5, 2020

lovit mentioned this issue Oct 5, 2020

모두의 말뭉치: 신문 말뭉치 loader #107

Closed

lovit added a commit that referenced this issue Oct 6, 2020

Implement News corpus (#103)

0bcaffe

lovit added a commit that referenced this issue Oct 6, 2020

Add document_id to row index mapper (#103)

608118b

lovit added a commit that referenced this issue Oct 6, 2020

Rename: A modu corpus, a file (#103, #107)

9c81359

lovit added a commit that referenced this issue Oct 6, 2020

First file index is 1, not 0 (#103, #107)

4df3015

lovit added a commit that referenced this issue Oct 6, 2020

Change ModuNewsData attributes (#103, #107)

9f82b4a

lovit mentioned this issue Oct 6, 2020

모두의 말뭉치: 신문 말뭉치 loader #110

Merged

3 tasks

lovit added a commit that referenced this issue Oct 6, 2020

Update usage (#103, #107)

f0acef1

lovit added a commit that referenced this issue Oct 6, 2020

Update description of attributes (#103, #107)

d114e40

lovit added a commit that referenced this issue Oct 8, 2020

Implement Messenger Corpus (#103, #111)

3f60a40

lovit added a commit that referenced this issue Oct 8, 2020

Update usage (#103, #111)

7e0fbd1

lovit mentioned this issue Oct 8, 2020

모두의 말뭉치: 메신저 말뭉치 loader #112

Merged

3 tasks

lovit added a commit that referenced this issue Oct 10, 2020

Raise exception when corpus file is not found (#103, #107)

a037755

lovit added a commit that referenced this issue Oct 10, 2020

Raise exception when corpus file is not found (#103, #111)

7cc277b

lovit added a commit that referenced this issue Oct 10, 2020

Fix typo in usage (#103, #111)

fc434e5

lovit added a commit that referenced this issue Oct 10, 2020

Separate corpus path finding functions (#103, #107)

0807311

lovit added a commit that referenced this issue Oct 10, 2020

Change tqdm unit: lines in a file -> files (#103, #107)

6ede350

lovit added a commit that referenced this issue Oct 10, 2020

Separate corpus path finding functions (#103, #111)

e07bbd9

lovit added a commit that referenced this issue Oct 10, 2020

Implement ModuWebKorpus (#103, #113)

8ef8367

lovit added a commit that referenced this issue Oct 10, 2020

Update web corpus usage (#103, #113)

953497a

lovit added a commit that referenced this issue Oct 10, 2020

Unify variable name paths_or_dir (#103, #113)

a7d7ff4

lovit mentioned this issue Oct 10, 2020

Dev modu web#113 #117

Merged

3 tasks

lovit added a commit that referenced this issue Oct 10, 2020

Implement written corpus (#103, #114)

b53059c

lovit added a commit that referenced this issue Oct 10, 2020

Update written corpus usage (#103, #114)

1b96877

lovit added a commit that referenced this issue Oct 10, 2020

Update messenger corpus usage (#103, #111)

1cf228e

lovit added a commit that referenced this issue Oct 10, 2020

Implement spoken corpus loader (#103, #115)

8e71872

lovit added a commit that referenced this issue Oct 10, 2020

Update spoken corpus usage (#103, #115)

8136a68

This was referenced Oct 10, 2020

모두의 말뭉치: 문어 말뭉치 loader #119

Merged

모두의 말뭉치: 구어 말뭉치 loader #120

Merged

lovit added a commit that referenced this issue Oct 10, 2020

Update corpus size (#103)

fb08d0a

lovit added a commit that referenced this issue Oct 10, 2020

Implement named entity corpus loader (#103, #116)

74ad6cc

lovit added a commit that referenced this issue Oct 10, 2020

Update named entity corpus usage (#103, #116)

a00f21d

lovit mentioned this issue Oct 10, 2020

모두의 말뭉치: 개체명 분석 말뭉치 loader #125

Merged

3 tasks

lovit added a commit that referenced this issue Oct 10, 2020

Change NamedEntityExample.__str__ (#103, #116)

8d910f3

lovit added a commit that referenced this issue Oct 10, 2020

Change NamedEntityExample.__str__ (#103, #116)

acc5cb0

lovit added a commit that referenced this issue Oct 10, 2020

Implement morpheme corups loader (#103, #122)

6fd3f0c

lovit added a commit that referenced this issue Oct 10, 2020

Skip empty tagged sentence in spoken corpus (#103, #122)

d26e74a

lovit added a commit that referenced this issue Oct 10, 2020

Update morpheme corpus usage (#103, #122)

717ee93

lovit added a commit that referenced this issue Oct 10, 2020

Update morpheme tagset (#103, #122)

f5d0141

lovit mentioned this issue Oct 10, 2020

모두의 말뭉치: 형태 분석 말뭉치 loader #126

Merged

3 tasks

lovit mentioned this issue Oct 14, 2020

모두의 말뭉치 loader classes 를 Korpora 에 추가 #132

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

모두의 말뭉치 loader #103

모두의 말뭉치 loader #103

lovit commented Oct 4, 2020 •

edited

Loading

lovit commented Oct 4, 2020 •

edited

Loading

lovit commented Oct 4, 2020

lovit commented Oct 4, 2020

ratsgo commented Oct 5, 2020

ratsgo commented Oct 5, 2020 •

edited

Loading

lovit commented Oct 5, 2020

lovit commented Oct 5, 2020 •

edited

Loading

lovit commented Oct 14, 2020

모두의 말뭉치 loader #103

모두의 말뭉치 loader #103

Comments

lovit commented Oct 4, 2020 • edited Loading

lovit commented Oct 4, 2020 • edited Loading

lovit commented Oct 4, 2020

lovit commented Oct 4, 2020

ratsgo commented Oct 5, 2020

ratsgo commented Oct 5, 2020 • edited Loading

lovit commented Oct 5, 2020

lovit commented Oct 5, 2020 • edited Loading

lovit commented Oct 14, 2020

lovit commented Oct 4, 2020 •

edited

Loading

lovit commented Oct 4, 2020 •

edited

Loading

ratsgo commented Oct 5, 2020 •

edited

Loading

lovit commented Oct 5, 2020 •

edited

Loading