Skip to content

Commit

Permalink
Fix Japanese handling (#107)
Browse files Browse the repository at this point in the history
* Fix Japanese handling

This changes the Japanese tokenizer to use versions of mecab-python3 1.0
or greater. This means the package will work on Windows. (#104)

However, since the Japanese tokenizer pulls in heavy dependencies and
isn't necessary unless you're dealing with Japanese, I moved it to
optional dependencies. You can install sacrebleu with Japanese support
like below:

    pip install sacrebleu[ja]

That will install mecab-python3 and ipadic.

This also includes basic tests to check that the tokenization is as
exepcted for IPAdic.

* Make travis install the Japanese deps

* Remove old comments
  • Loading branch information
polm authored Jul 29, 2020
1 parent c36e558 commit 6e663d6
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 9 deletions.
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ before_install:

install:
- pip install pytest-cov
- pip install .
- pip install .[ja]

language: python
python:
Expand Down
23 changes: 17 additions & 6 deletions sacrebleu/tokenizers/tokenizer_ja_mecab.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,31 @@
# -*- coding: utf-8 -*-

import MeCab
try:
import MeCab
import ipadic
except ModuleNotFoundError:
# Don't fail until the tokenizer is actually used
MeCab = None

from .tokenizer_none import NoneTokenizer

FAIL_MESSAGE = """
Japanese tokenization requires extra dependencies, but you do not have them installed.
Please install them like so.
pip install sacrebleu[ja]
"""

class TokenizerJaMecab(NoneTokenizer):
def __init__(self):
self.tagger = MeCab.Tagger("-Owakati")
if MeCab is None:
raise RuntimeError(FAIL_MESSAGE)
self.tagger = MeCab.Tagger(ipadic.MECAB_ARGS + " -Owakati")

# make sure the dictionary is IPA
# sacreBLEU is only compatible with 0.996.5 for now
# Please see: https://github.com/mjpost/sacrebleu/issues/94
d = self.tagger.dictionary_info()
assert d.size == 392126, \
"Please make sure to use IPA dictionary for MeCab"
"Please make sure to use the IPA dictionary for MeCab"
# This asserts that no user dictionary has been loaded
assert d.next is None

def __call__(self, line):
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,13 +129,13 @@ def get_description():
# your project is installed. For an analysis of "install_requires" vs pip's
# requirements files see:
# https://packaging.python.org/en/latest/requirements.html
install_requires = ['typing;python_version<"3.5"', 'portalocker', 'mecab-python3==0.996.5'],
install_requires = ['typing;python_version<"3.5"', 'portalocker'],

# List additional groups of dependencies here (e.g. development
# dependencies). You can install these using the following syntax,
# for example:
# $ pip install -e .[dev,test]
extras_require = {},
extras_require = {'ja': ['mecab-python3>=1.0,<2.0', 'ipadic>=1.0,<2.0'] },

# To provide executable scripts, use entry points in preference to the
# "scripts" keyword. Entry points provide cross-platform support and allow
Expand Down

0 comments on commit 6e663d6

Please sign in to comment.