Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

namuwikitext 파싱 오류 #202

Open
jeongukjae opened this issue Apr 25, 2021 · 0 comments
Open

namuwikitext 파싱 오류 #202

jeongukjae opened this issue Apr 25, 2021 · 0 comments

Comments

@jeongukjae
Copy link

현재 namuwikitext는 \n =를 기준으로 문서를 나누고 있는데, 본문 중에 \n =로 행이 시작하지만 =로 끝나지 않는 행이 존재하는 것을 확인했습니다. Korpora.utils::load_wikitext의 주석을 확인해볼 때 heading을 기준으로 split하는 것이 의도된 동작으로 보여 이슈를 남깁니다.

정규식 ^ =.*[^=]$ 으로 검색해보시면 알 수 있고, 자세한 내용은 jeongukjae/tfds-korean#12 (comment) 에 적어놓았습니다.

Korpora/Korpora/utils.py

Lines 64 to 91 in a2c1ba8

def load_wikitext(path, num_lines=-1):
"""
Wikitext format
= Head1 =
text ...
text ...
= = 2ead = =
text ...
text ...
"""
if num_lines <= 0:
with open(path, encoding='utf-8') as f:
texts = f.read().split('\n =')
else:
lines = []
with open(path, encoding='utf-8') as f:
for i, line in enumerate(f):
if (i >= num_lines):
break
lines.append(line)
texts = ''.join(lines).split('\n =')
# fix missing prefix
texts = [texts[0]] + [f' ={text}' for text in texts[1:]]
return texts

관련 이슈: lovit/namuwikitext#10, jeongukjae/tfds-korean#12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant