-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
모두의 말뭉치: 신문 말뭉치 loader #110
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
362d984
Implement News corpus (#103)
lovit f1d4ad0
Add document_id to row index mapper (#103)
lovit c03b6b3
Print only `KorpusData` class instances in `Korpus.__str__`
lovit 8d9011d
Rename: A modu corpus, a file (#103)
lovit 0bcaffe
Implement News corpus (#103)
lovit 608118b
Add document_id to row index mapper (#103)
lovit 78d5082
Print only `KorpusData` class instances in `Korpus.__str__`
lovit 9c81359
Rename: A modu corpus, a file (#103, #107)
lovit 4df3015
First file index is 1, not 0 (#103, #107)
lovit 9f82b4a
Change ModuNewsData attributes (#103, #107)
lovit 4a8415d
Merge branch 'dev-modu#103' of https://github.com/ko-nlp/Korpora into…
lovit 585d400
No print KorpusData.name in KorpusData.__str__
lovit f0acef1
Update usage (#103, #107)
lovit d114e40
Update description of attributes (#103, #107)
lovit a037755
Raise exception when corpus file is not found (#103, #107)
lovit 0807311
Separate corpus path finding functions (#103, #107)
lovit 6ede350
Change tqdm unit: lines in a file -> files (#103, #107)
lovit File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,159 @@ | ||
import json | ||
import os | ||
import re | ||
from dataclasses import dataclass | ||
from glob import glob | ||
from tqdm import tqdm | ||
from typing import List | ||
from Korpora.korpora import Korpus, KorpusData | ||
|
||
|
||
description = """ 모두의 말뭉치는 문화체육관광부 산하 국립국어원에서 제공하는 말뭉치로 | ||
총 13 개의 말뭉치로 이뤄져 있습니다. | ||
|
||
해당 말뭉치를 이용하기 위해서는 국립국어원 홈페이지에 가셔서 "회원가입 > 말뭉치 신청 > 승인"의 | ||
과정을 거치셔야 합니다. | ||
|
||
https://corpus.korean.go.kr/#none | ||
|
||
모두의 말뭉치는 승인 후 다운로드 가능 기간 및 횟수 (3회) 에 제한이 있습니다. | ||
|
||
로그인 기능 및 Korpora 패키지에서의 다운로드 기능을 제공하려 하였지만, | ||
국립국어원에서 위의 이유로 이에 대한 기능은 제공이 불가함을 확인하였습니다. | ||
|
||
Korpora==0.2.0 에서는 "개별 말뭉치 신청 > 승인"이 완료되었다고 가정, | ||
로컬에 다운로드 된 말뭉치를 손쉽게 로딩하는 기능만 제공할 예정입니다 | ||
|
||
(Korpora 개발진 lovit@github, ratsgo@github)""" | ||
|
||
license = """ 모두의 말뭉치의 모든 저작권은 `문화체육관광부 국립국어원 | ||
(National Institute of Korean Language)` 에 귀속됩니다. | ||
정확한 라이센스는 확인 중 입니다.""" | ||
|
||
|
||
class ModuNewsKorpus(Korpus): | ||
def __init__(self, root_dir_or_paths, load_light=True, force_download=False): | ||
super().__init__(description, license) | ||
paths = find_corpus_paths(root_dir_or_paths) | ||
if load_light: | ||
self.train = ModuNewsDataLight('모두의_뉴스_말뭉치(light).train', load_modu_news(paths, load_light)) | ||
else: | ||
self.train = ModuNewsData('모두의_뉴스_말뭉치.train', load_modu_news(paths, load_light)) | ||
self.row_to_documentid = [news.document_id for news in self.train] | ||
self.documentid_to_row = {document_id: idx for idx, document_id in enumerate(self.row_to_documentid)} | ||
|
||
|
||
class ModuNewsData(KorpusData): | ||
def __init__(self, name, news): | ||
super().__init__(name, news) | ||
self.document_ids = [doc.document_id for doc in news] | ||
self.titles = [doc.title for doc in news] | ||
self.authors = [doc.author for doc in news] | ||
self.publishers = [doc.publisher for doc in news] | ||
self.dates = [doc.date for doc in news] | ||
self.topics = [doc.topic for doc in news] | ||
self.original_topics = [doc.original_topic for doc in news] | ||
self.texts = [doc.paragraph for doc in news] | ||
|
||
def __getitem__(self, index): | ||
news = ModuNews( | ||
self.document_ids[index], | ||
self.titles[index], | ||
self.authors[index], | ||
self.publishers[index], | ||
self.dates[index], | ||
self.topics[index], | ||
self.original_topics[index], | ||
self.texts[index].split('\n')) | ||
return news | ||
|
||
|
||
class ModuNewsDataLight(KorpusData): | ||
def __init__(self, name, news): | ||
super().__init__(name, news) | ||
self.texts = [doc.paragraph for doc in news] | ||
self.titles = [doc.title for doc in news] | ||
self.document_ids = [doc.document_id for doc in news] | ||
|
||
def __getitem__(self, index): | ||
news = ModuNewsLight( | ||
self.document_ids[index], | ||
self.titles[index], | ||
self.texts[index]) | ||
return news | ||
|
||
|
||
@dataclass | ||
class ModuNews: | ||
document_id: str | ||
title: str | ||
author: str | ||
publisher: str | ||
date: str | ||
topic: str | ||
original_topic: str | ||
paragraph: List[str] | ||
|
||
|
||
@dataclass | ||
class ModuNewsLight: | ||
document_id: str | ||
title: str | ||
paragraph: str | ||
|
||
|
||
def document_to_a_news(document): | ||
document_id = document['id'] | ||
meta = document['metadata'] | ||
title = meta['title'] | ||
author = meta['author'] | ||
publisher = meta['publisher'] | ||
date = meta['date'] | ||
topic = meta['topic'] | ||
original_topic = meta['original_topic'] | ||
paragraph = '\n'.join([p['form'] for p in document['paragraph']]) | ||
return ModuNews(document_id, title, author, publisher, date, topic, original_topic, paragraph) | ||
|
||
|
||
def document_to_a_news_light(document): | ||
document_id = document['id'] | ||
meta = document['metadata'] | ||
title = meta['title'] | ||
paragraph = '\n'.join([p['form'] for p in document['paragraph']]) | ||
return ModuNewsLight(document_id, title, paragraph) | ||
|
||
|
||
def find_corpus_paths(root_dir_or_paths): | ||
prefix_pattern = re.compile('N[WLPIZ]RW') | ||
def match(path): | ||
prefix = path.split(os.path.sep)[-1][:4] | ||
return prefix_pattern.match(prefix) | ||
|
||
# directory + wildcard | ||
if isinstance(root_dir_or_paths, str): | ||
paths = sorted(glob(f'{root_dir_or_paths}/*.json') + glob(root_dir_or_paths)) | ||
else: | ||
paths = root_dir_or_paths | ||
|
||
paths = [path for path in paths if match(path)] | ||
if not paths: | ||
raise ValueError('Not found corpus files. Check `root_dir_or_paths`') | ||
return paths | ||
|
||
|
||
def load_modu_news(paths, load_light): | ||
transform = document_to_a_news_light if load_light else document_to_a_news | ||
news = [] | ||
for i_path, path in enumerate(tqdm(paths, desc='Loading ModuNews', total=len(paths))): | ||
with open(path, encoding='utf-8') as f: | ||
data = json.load(f) | ||
documents = data['document'] | ||
news += [transform(document) for document in documents] | ||
return news | ||
|
||
|
||
def fetch_modu(): | ||
raise NotImplementedError( | ||
"국립국어원에서 API 기능을 제공해 줄 수 없음을 확인하였습니다." | ||
"\n이에 따라 모두의 말뭉치는 fetch 기능을 제공하지 않습니다" | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
제가 지금 테스트 중인데요. 사용자가 ModuNewsKorpus의
root_dir_or_paths
를 잘못 입력하여 데이터가 전혀 로드되지 않을 경우(즉 self.text의 길이가 0) __getitem__이 다음과 같은 에러를 발생시키고 있음을 확인했습니다.아울러 같은 상황에서 news_corpus.train을 확인하면 다음과 같은 에러가 뜹니다.
따라서
root_dir_or_paths
에 있는 모든 파일들을 읽어오는 기존 로직을 타되, 경로 설정 등이 잘못되어 그 내용이 전혀 로드되지 않을 경우에 해당하는 방어 로직을 추가하는 것이 어떨까 합니다.