Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese Model #3756

Closed
polm opened this issue May 17, 2019 · 182 comments
Closed

Japanese Model #3756

polm opened this issue May 17, 2019 · 182 comments
Labels
enhancement Feature requests and improvements lang / ja Japanese language data and models models Issues related to the statistical models

Comments

@polm
Copy link
Contributor

polm commented May 17, 2019

Feature description

I'd like to add a Japanese model to spaCy. (Let me know if this should be discussed in #3056 instead - I thought it best to just tag it in for now.)

The Ginza project exists, but currently it's a repackaging of spaCy rather than a model to use with normal spaCy, and I think some of the resources it uses may be tricky to integrate from a licensing perspective.

My understanding is that the main parts of a model now are 1. the dependency model, 2. NER, and 3. word vectors. Notes on each of those:

  1. Dependencies. For dependency info we can use UD Japanese GSD. UD BCCWJ is bigger but the corpus has licensing issues. GSD is rather small but probably enough to be usable (8k sentences). I have trained it with spaCy and there were no conversion issues.

  2. NER. I don't know of a good dataset for this; Christopher Manning mentioned the same problem two years ago. I guess I could make one based on Wikipedia - I think some other spaCy models use data produced by Nothman et al's method, which skipped Japanese to avoid dealing with segmentation, so that might be one approach. (A reasonable question here is: what do people use for NER in Japanese? Most tokenizer dictionaries, including Unidic, have entity-like information and make it easy to add your own entries, so that's probably the most common approach.)

  3. Vectors. Using JA Wikipedia is no problem. I haven't worked with the Common Crawl before and I'm not sure I have the hardware for it, buf if I could get some help on it that's also an option.

So, how does that sound? If there's no issues with that I'll look into creating an NER dataset.

@hiroshi-matsuda-rit
Copy link
Contributor

Hi! I have the same opinion. I think there is a lot of demand to integrate Japanese model to spaCy. And actually, I'm preparing to integrate GiNZA to spaCy v2.1 and going to create some Pull Requests for this integration with a parser model trained on (formally licensed) UD-Japanese BCCWJ and a NER model trained on Kyoto University Web Document Leads Corpus (KDWLC), by the middle of June.

Unfortunately, the UD-Japanese GSD corpus is almost obsoleted after UD v2.4 and will not be maintained by the UD community anymore. They are focusing on developing BCCWJ version from v2.3. Our lab (Megagon labs) has a mission to develop and publish the models trained on UD-Japanese BCCWJ with the licenser (NINJAL).

Because of the licensing problems (mostly from the newspaper) and the difference among the word segmentation policies coming with POS systems, there are some difficulties to train and publish commercial-use NER models. But, there is a good solution which I've used for GiNZA. The Kyoto University Web Document Leads Corpus has high quality gold-standard annotations containing NER spans with popular NER categories. I used a part of KWDLC's NER annotations which meet to the boundaries of Sudachi morphological analyzer (mode C) to train our models.
http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?KWDLC

How do you think this approach? @polm

@honnibal
Copy link
Member

honnibal commented Jun 1, 2019

If you're able to prepare the dataset files, I'd definitely be glad to add a Japanese model to the training pipeline.

For the NER model, I would hope that better text than Wikipedia could be found. Wikipedia is a fairly specific genre, so models trained on Wikipedia text aren't always broadly useful. Perhaps text from the common crawl could be used?

@honnibal honnibal added the enhancement Feature requests and improvements label Jun 1, 2019
@polm
Copy link
Contributor Author

polm commented Jun 1, 2019

@hiroshi-matsuda-rit I think that sounds great! Let me know if there's anything I can help with.

Unfortunately, the UD-Japanese GSD corpus is almost obsoleted after UD v2.4 and will not be maintained by the UD community anymore.

I had not realized this, that's really unfortunate.

The Kyoto University Web Document Leads Corpus has high quality gold-standard annotations containing NER spans with popular NER categories. I used a part of KWDLC's NER annotations which meet to the boundaries of Sudachi morphological analyzer (mode C) to train our models.

I was not aware this had good NER annotations, so that sounds like a great option. My only concern is the license is somewhat unusual so I'm not sure how the spaCy team or downstream users would feel about using a model based on the data.

The thing I thought might be a problem with BCCWJ and similar data is that based on #3056 my understanding is that the spaCy maintainers want to have access to the training data, not just compiled models, so that the models can be added to the training pipeline and updated whenever spaCy's code is updated (see here). @honnibal Are you open to adding data to the training pipeline that requires a paid license, like the BCCWJ? I didn't see a mention of any cases like that in #3056 so I wasn't sure...

For the NER model, I would hope that better text than Wikipedia could be found. Wikipedia is a fairly specific genre, so models trained on Wikipedia text aren't always broadly useful. Perhaps text from the common crawl could be used?

That makes sense, though it seems easier to use the Crawl for word vectors. Is there a good non-manual way to make NER training data from the Common Crawl? I thought of starting with Wikipedia since the approach in Nothman et al (which I think was/is used for some spaCy training data?) seemed relatively quick to implement; I've posted a little progress on that here. I guess with the Common Crawl a similar strategy could be used, looking for links to Wikipedia or something? With a quick search all I found was DepCC, which seems to just use the output of the Stanford NER models for NER data.

@honnibal
Copy link
Member

honnibal commented Jun 1, 2019

Is there a good non-manual way to make NER training data from the Common Crawl?

The non-manual approaches are pretty limited, really. If you're able to put a little time into it, we could give you a copy of Prodigy? You should be able to get a decent training corpus with about 20 hours of annotation.

Are you open to adding data to the training pipeline that requires a paid license, like the BCCWJ?

We're open to buying licenses to corpora, sure, so long as once we buy the license we can train and distribute models for free. We've actually paid quite a lot in license fees for the English and German models. If the license would require users to buy the corpus, it's still a maybe, but I'm less enthusiastic.

@ines ines added lang / ja Japanese language data and models models Issues related to the statistical models labels Jun 1, 2019
@polm
Copy link
Contributor Author

polm commented Jun 1, 2019

The non-manual approaches are pretty limited, really. If you're able to put a little time into it, we could give you a copy of Prodigy? You should be able to get a decent training corpus with about 20 hours of annotation.

If 20 hours is enough then I'd be glad to give it a shot! I figured it'd take much longer than that.

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 1, 2019

@honnibal

If you're able to prepare the dataset files, I'd definitely be glad to add a Japanese model to the training pipeline.

Sure. I think it would be better if training datasets were published under OSS license even if only small part of BCCWJ. I'm going to ask this dataset publishing issue to the licenser. Please wait just few days.

For the NER model, I would hope that better text than Wikipedia could be found. Wikipedia is a fairly specific genre, so models trained on Wikipedia text aren't always broadly useful. Perhaps text from the common crawl could be used?

If we could use common crawl texts without worrying about the license problems, I think that KWDLC would be a good option for NER training corpus because KWDLC's annotation data has no limitation for commercial use and its entire text is retrieved from the web in much same way as common crawl.

@polm

I think that sounds great! Let me know if there's anything I can help with.

I'm very happy to hear that and would like your helps. Thanks a lot! Please review my branches later.

I had not realized this, that's really unfortunate.

Kanayama-san, he is one of the founding members of UD-Japanese community, has updated UD-Japanese GSD master branch some weeks ago.
I'd like to ask him about the future of GSD at the UD-Japanese meeting scheduled on June 17 in Kyoto.

My only concern is the license is somewhat unusual so I'm not sure how the spaCy team or downstream users would feel about using a model based on the data.

I guess that we should use the datasets having well-known license such as MIT or GPL to train the language models which spaCy supports officially.
To expand the possibilities, I'd like to discuss about re-publishing KWDLC corpus under MIT license, with our legal department.

Thanks!

@hiroshi-matsuda-rit
Copy link
Contributor

@polm @honnibal Good news for us!
I discussed with Asahara-san, he is the leader of UD-Japanese community, about Japanese UD open dataset issue this morning.
They are planning to maintain GSD open dataset continuously (applying same refining methods for both BCCWJ and GSD) and going to decide this issue at the UD-Japanese meeting on June 17.
If this plan would be approved by the committee, I think UD-Japanese GSD will be a superior choice for the training dataset of spaCy's standard model.

I'd like to prepare a new spaCy model trained on UD-Japanese GSD (parser) and KWDLC (NER) in a few days. I'd use GiNZA's logics but it will be much smarter than previous version of GiNZA. I'm refactoring the pipeline (I dropped dummy root token).

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 6, 2019

@polm @honnibal
I've just uploaded a trial version of ja_ginza_gsd to the GiNZA's github repository.

https://github.com/megagonlabs/ginza/releases/tag/v1.1.0-gsd_preview-1

I used the latest UD-Japanese GSD dataset and KWDLC to train that model.

Parsing accuracy (using separated test dataset):

sentence=550, gold_token=12371, result_token=12471
sentence:LAS=0.1673,UAS=0.2418,POS=0.2909,boundary=0.6182,root=0.9473
tkn_recall:LAS=0.8413,UAS=0.8705,POS=0.9270,boundary=0.9669
tkn_precision:LAS=0.8346,UAS=0.8635,POS=0.9196,boundary=0.9591

NER accuracy (using same dataset for both train and test, sorry):

labels: <ALL>, sentence=14742, gold_ent=7377, result_ent=7216
 overlap
  recall: 0.9191 (label=0.8501), precision: 0.9426 (label=0.8713)
 include
  recall: 0.9183 (label=0.8499), precision: 0.9417 (label=0.8710)
 one side border
  recall: 0.9176 (label=0.8493), precision: 0.9407 (label=0.8701)
 both borders
  recall: 0.8650 (label=0.8090), precision: 0.8843 (label=0.8271)

label confusion matrix
       |  DATE |  LOC  | MONEY |  ORG  |PERCENT| PERSON|PRODUCT|  TIME | {NONE}
  DATE |   1435|      0|      0|      1|      0|      0|      2|      1|     71
  LOC  |      1|   2378|      0|     65|      0|      9|     13|      0|    139
 MONEY |      2|      0|    109|      0|      1|      0|      0|      0|      1
  ORG  |      1|     83|      1|    826|      0|     22|     91|      0|    120
PERCENT|      6|      0|      0|      0|     82|      0|      0|      0|      8
 PERSON|      1|      4|      0|     26|      0|    852|      9|      0|     40
PRODUCT|      4|     27|      0|    110|      0|     18|    555|      0|    193
  TIME |     17|      0|      0|      0|      0|      0|      0|     50|     25
 {NONE}|    121|     87|      3|     31|      7|     46|     99|     20|      0

I'm still refactoring the source codes to resolve two important issues below:

  1. Use "entry-point" method to make GiNZA as a custom language class
    https://spacy.io/usage/saving-loading#entry-points-components

  2. Find a way to disambiguate the POS of root token, mostly NOUN or VERB, with the parse result
    I use a kind of dependency label expansion technique for both merging the over-segmented tokens and correcting POS errors.
    https://github.com/megagonlabs/ginza/blob/v1.1.0-gsd_preview-1/ja_ginza/parse_tree.py#L440
    I read a mention below, which might be related with this issue, but I could not find the APIs to do that.

Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 7, 2019

  1. Use "entry-point" method to make GiNZA as a custom language class

I've done the refactoring to use entry points.
And now, GiNZA's source codes and its package structure become much smarter than ever.
https://github.com/megagonlabs/ginza/releases/tag/v1.1.1-gsd_preview-2

I'd add one more improvement around the accuracy of root token's pos, tonight.

@hiroshi-matsuda-rit
Copy link
Contributor

  1. Find a way to disambiguate the POS of root token, mostly NOUN or VERB, with the parse result

I've implemented a rule-based logic for "Sa-hen" noun pos disambiguation and released the final alpha version.

https://github.com/megagonlabs/ginza/releases/tag/v1.1.2-gsd_preview-3

I'm going to add some documents and tests, and then create a PR to integrate the new ja language class and models to spaCy master branch.

Please review above codes and give me some feedback.
Thanks a lot! @polm @honnibal

@KoichiYasuoka
Copy link
Contributor

Thank you, @hiroshi-matsuda-rit san, for your work of

https://github.com/megagonlabs/ginza/releases/tag/v1.1.2-gsd_preview-3

but I could not find ja_gsd-1.1.2.tar.gz in the Assets. Umm...

@hiroshi-matsuda-rit
Copy link
Contributor

I'm so sorry for that. I added it. Please try downloading again. @KoichiYasuoka

@KoichiYasuoka
Copy link
Contributor

KoichiYasuoka commented Jun 8, 2019

Thank you again, @hiroshi-matsuda-rit san, and I've checked some attributes in Token https://spacy.io/api/token as follows:

for t,bi,typ in zip(doc,doc._.bunsetu_bi_label,doc._.bunsetu_position_type):
  print(t.i,t.orth_,t.lemma_,t.pos_,t.tag_,"_",t.head.i,t.dep_,"_",bi,typ)
0 旅 旅 VERB VERB _ 2 acl _ B SEM_HEAD
1 する 為る AUX AUX _ 0 aux _ I SYN_HEAD
2 時 時 NOUN NOUN _ 6 nsubj _ B SEM_HEAD
3 は は ADP 助詞,係助詞,*,* _ 2 case _ I SYN_HEAD
4 旅 旅 NOUN NOUN _ 6 obj _ B SEM_HEAD
5 を を ADP 助詞,格助詞,*,* _ 4 case _ I SYN_HEAD
6 する 為る AUX AUX _ 6 ROOT _ B ROOT

and I found several t.tag_ were destroyed by t.pos_ . Yes, I can use t._.pos_detail instead of t.tag_ only in Japanese model, but it seems rather complicated when I use it with other language models.

@hiroshi-matsuda-rit
Copy link
Contributor

I've ever tried to store detailed part-of-speech information to tag_ but could not do that without modifying data structure of spaCy token, and unfortunately this modification needs C level recompilation and reduces compatibility with other languages.

I'd like to read spaCy's source codes again and report whether we could store language specific tag to token.tag_ or not. @KoichiYasuoka

@hiroshi-matsuda-rit
Copy link
Contributor

I've found that the recent spaCy versions can change token.pos_.
I'm still testing that but I'd like to release the trial version below.
https://github.com/megagonlabs/ginza/releases/tag/v1.2.0-gsd_preview-4
Thanks, @KoichiYasuoka

@KoichiYasuoka
Copy link
Contributor

KoichiYasuoka commented Jun 8, 2019

Thank you again and again, @hiroshi-matsuda-rit san, I'm just trying
https://github.com/megagonlabs/ginza/releases/tag/v1.2.0-gsd_preview-4
now t.tag_ works well. But 「名詞である」 goes to error now.

>>> import spacy
>>> ja=spacy.load("ja_gsd")
>>> s=ja("名詞である")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yasuoka/.local/lib/python3.7/site-packages/spacy/language.py", line 390, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/home/yasuoka/.local/lib/python3.7/site-packages/ginza/japanese_corrector.py", line 21, in __call__
    set_bunsetu_bi_type(doc)
  File "/home/yasuoka/.local/lib/python3.7/site-packages/ginza/japanese_corrector.py", line 83, in set_bunsetu_bi_type
    t.pos_ in FUNC_POS or
  File "token.pyx", line 864, in spacy.tokens.token.Token.pos_.__get__
KeyError: 405

So as 「猫である」 or 「人である」 etc. Then 「である」 seems to have wrong Key for t.pos_ .

>>> s=ja("である")
>>> for t in s:
...   dir(t)
...
['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'remove_extension', 'right_edge', 'rights', 'sent', 'sent_start', 'sentiment', 'set_extension', 'shape', 'shape_', 'similarity', 'string', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_']
>>> print(t.tag_)
接続詞,*,*,*+連体詞,*,*,*
>>> print(t.pos_)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "token.pyx", line 864, in spacy.tokens.token.Token.pos_.__get__
KeyError: 405
>>> print(t.pos)
405

@KoichiYasuoka
Copy link
Contributor

KoichiYasuoka commented Jun 9, 2019

One more thing, @hiroshi-matsuda-rit san, how do you think about to change t._.bunsetu_bi_label into t._.chunk_iob ? If you agree (in the sense of t.ent_iob_ for NER), how about t._.bunsetu_position_type into t._.chunk_pos ? "bunsetu" is rather long for me, and I often misspelled it for "bunsetsu"...

@hiroshi-matsuda-rit
Copy link
Contributor

@KoichiYasuoka Thanks you!
I just opened an issue on ginza's repository.
megagonlabs/ginza#26
Let us make further discussions on that aggressively!

Actually, I've improved GiNZA's codes on the problems which your reported above, and just started refactoring for training procedure.
I'll report the progress on that tomorrow.

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 11, 2019

@honnibal I'd like to switch the discussing place from #3818 to here because it is Japanese model specific issue.

I'm trying to use spaCy's train command with SudachiTokenizer to create Japanese model as I mentioned in #3818.
But unfortunately, due to my misunderstanding, it doesn't work yet.
I used UD-Japanese GSD dataset converted to json, and the train command work well with that when I add -G option.
But if I drop -G option:

python -m spacy train ja ja_gsd-ud ja_gsd-ud-train.json ja_gsd-ud-dev.json -p tagger,parser -ne 2 -V 1.2.2 -pt dep,tag -v models/ja_gsd-1.2.1/ -VV
...
✔ Saved model to output directory                                                                                                                                                         
ja_gsd-ud/model-final
⠙ Creating best model...
Traceback (most recent call last):
  File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
    losses=losses,
  File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/language.py", line 457, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 413, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 519, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "transition_system.pyx", line 86, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
  File "arc_eager.pyx", line 592, in spacy.syntax.arc_eager.ArcEager.set_costs
ValueError: [E020] Could not find a gold-standard action to supervise the dependency parser. The tree is non-projective (i.e. it has crossing arcs - see spacy/syntax/nonproj.pyx for definitions). The ArcEager transition system only supports projective trees. To learn non-projective representations, transform the data before training and after parsing. Either pass `make_projective=True` to the GoldParse class, or use spacy.syntax.nonproj.preprocess_training_data.

I added probe code to arc_eager.set_cost() like below, and found some fragmented content might be created by Levenshtein alignment procedure.

        if n_gold < 1:
            for t in zip(gold.words, gold.tags, gold.heads, gold.labels):
                print(t)

The printed content for the sentence "高橋の地元、盛岡にある「いわてアートサポートセンター」にある風のスタジオでは、地域文化芸術振興プランと題して「語りの芸術祭inいわて盛岡」と呼ばれる朗読会が過去に上演されてお り、高橋の作品も何度か上演されていたことがある。" is:

('高橋', 'NNP', 2, 'nmod')
('の', 'PN', 0, 'case')
('地元', 'NN', 4, 'nmod')
('、', 'SYM', 2, 'punct')
('盛岡', 'NNP', 6, 'iobj')
('に', 'PS', 4, 'case')
('ある', 'VV', None, 'advcl')
('「', 'SYM', None, 'punct')
(None, None, None, None)
('アート', 'NN', 10, 'compound')
('サポートセンター', 'NN', 13, 'iobj')
('」', 'SYM', 10, 'punct')
('に', 'PS', 10, 'case')
('ある', 'VV', 16, 'acl')
('風', 'NN', 16, 'nmod')
('の', 'PN', 14, 'case')
('スタジオ', 'NN', 44, 'obl')
('で', 'PS', 16, 'case')
('は', 'PK', 16, 'case')
('、', 'SYM', 16, 'punct')
('地域', 'NN', 24, 'compound')
('文化', 'NN', 24, 'compound')
('芸術', 'NN', 24, 'compound')
('振興', 'NN', 24, 'compound')
('プラン', 'NN', None, 'obl')
('と', 'PS', 24, 'case')
(None, None, None, None)
('て', 'PC', None, 'mark')
('「', 'SYM', 29, 'punct')
('語り', 'NN', 34, 'nmod')
('の', 'PN', 29, 'case')
('芸術祭', 'NN', 32, 'compound')
('in', 'NNP', 34, 'nmod')
(None, None, None, None)
('盛岡', 'NNP', 37, 'obl')
('」', 'SYM', 34, 'punct')
('と', 'PQ', 34, 'case')
('呼ぶ', 'VV', 40, 'acl')
('れる', 'AV', 37, 'aux')
('朗読', 'NN', 40, 'compound')
('会', 'XS', 44, 'nsubj')
('が', 'PS', 40, 'case')
('過去', 'NN', 44, 'iobj')
('に', 'PS', 42, 'case')
('上演', 'VV', 65, 'advcl')
(None, None, None, None)
(None, None, None, None)
('て', 'PC', 44, 'mark')
('おる', 'AV', 44, 'aux')
('、', 'SYM', 44, 'punct')
('高橋', 'NNP', 52, 'nmod')
('の', 'PN', 50, 'case')
('作品', 'NN', 57, 'obl')
('も', 'PK', 52, 'case')
(None, None, None, None)
(None, None, None, None)
('か', 'PF', 59, 'mark')
('上演', 'VV', 65, 'csubj')
('何度', 'NN', 59, 'subtok')
('何度', 'NN', 57, 'obl')
('て', 'PC', 57, 'mark')
(None, None, None, None)
('た', 'AV', 57, 'aux')
('こと', 'PNB', 57, 'mark')
('が', 'PS', 57, 'case')
('ある', 'VV', 65, 'ROOT')
('。', 'SYM', 65, 'punct')

The missing words are "いわて", "題し", "さ", "れ", and "何度" is aligned twice after the following word.

When I used only the former part of train dataset (before that sentence), the error did not occurred but the UAS value decreased in 5 epochs.
It seems like there are some confusions in gold-retokenization procedure.
I'm researching the cause of this phenomenon.

@honnibal
Copy link
Member

@hiroshi-matsuda-rit Thanks, it's very possible my alignment code could be wrong, as I struggled a little to develop it. You might find it easier to develop a test case with the problem if you're calling it directly. You can find the tests here: https://github.com/explosion/spaCy/blob/master/spacy/tests/test_align.py

It may be that there's a bug which only gets triggered for non-Latin text, as maybe I'm doing something wrong with respect to methods like .lower() or .startswith(). Are you able to develop a test-case using Latin characters? If so, it'd be very preferable, as I would find it a lot easier to work with.

@hiroshi-matsuda-rit
Copy link
Contributor

@honnibal Thanks too, I'd like to try making some tests including both Latin and non-Latin ones.

@honnibal
Copy link
Member

honnibal commented Jun 11, 2019

The non-manual approaches are pretty limited, really. If you're able to put a little time into it, we could give you a copy of Prodigy? You should be able to get a decent training corpus with about 20 hours of annotation.

If 20 hours is enough then I'd be glad to give it a shot! I figured it'd take much longer than that.

@polm : Of course your mileage may vary, but the annotation tool is quite quick, especially if you do the common entities one at a time, and possibly use things like pattern rules. Even a small corpus can start to get useful accuracy, and once the initial model is produced, it can be used to bootstrap.

If you want to try it, send us an email? contact@explosion.ai

@hiroshi-matsuda-rit
Copy link
Contributor

@honnibal : I've tested your alignment function for several hours and I found that the current implementation is working correctly even for non-latin characters. I found it was a possible alignment, which I showed above, hence there is no need to add the test cases to test_align.py anymore.
Also, I fixed some bugs on my json generation codes but I'm still facing to other type of error. I'd report the details tomorrow morning. Thanks again for your suggestions!

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 11, 2019

@honnibal Finally, I've solved all the problems and now train command works well with SudachiTokenizer. The accuracy is getting better as the epoch proceeds.

I found three issues which prevent executing subtok unification.

  1. In spacy.pipeline.function.merge_subtokens(), we have to merge overlapped spans as below
    spans = [(start, end + 1) for _, start, end in matches]
    widest = []
    for start, end in spans:
        for i, (s, e) in enumerate(widest):
            if start <= s and e <= end:
                del widest[i]
            elif s <= start and end <= e:
                break
        else:
            widest.append((start, end))
    spans = [doc[start:end] for start, end in widest]
  1. spacy.pipeline.function.merge_subtokens() is receiving additional arguments from the pipe
def merge_subtokens(doc, label="subtok", batch_size=None, verbose=None):
  1. In GoldParse(), we have to prevent causing dependency loop while adding subtok arcs like below (but I'm not sure about the appropriate condition to do this)
                    if not is_last and i != self.gold_to_cand[heads[i2j_multi[i+1]]]:

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 18, 2019

@honnibal @polm I'd like to report on the result of UD-Japanese meeting held in Kyoto yesterday.
I asked the committees for two issues below and got useful feedback to publish a commercially available data-set.

Q1. Is the UD_Japanese-GSD data-set suitable for a training data for official Japanese model of spaCy?
A1. Probably, not.
The license of GSD data-set is CC-BY-NC-SA. It is 'gray' in Japanese law if someone uses the probabilistic models trained on the 'NC' data-set for commercial purposes (but in other jurisdictions, it might be allowed).
https://github.com/UniversalDependencies/UD_Japanese-GSD/blob/master/LICENSE.txt

We think it's safer to use UD_Japanese-PUD under CC-BY-SA license.
https://github.com/UniversalDependencies/UD_Japanese-PUD/blob/master/LICENSE.txt

My opinions:

  • Use the UD_Japanese-PUD data-set instead GSD (PUD is small but enough to learn basic dependency structure) for the early releases of spaCy Japanese language model
  • If it is not necessary to publish JSON data, we can publish the models trained on UD-Japanese BCCWJ as spaCy's Japanese model.
  • We'd like to create a new data-set which contains over 10,000 sentences with dependency and named-entity gold annotations under OSS license which allows commercial use, by the end of this year

I've published a PUD-based Japanese lang model as a preview version. It was trained on only 900 sentences but the accuracy is not so bad.
https://github.com/megagonlabs/ginza/releases/tag/v1.3.1-pud.preview-8
tkn_recall:LAS=0.8736,UAS=0.8930,UPOS=0.9393,boundary=0.9647
precision:LAS=0.8672,UAS=0.8864,UPOS=0.9324,boundary=0.9576

Shall I publish this PUD-based JSON files and send PR with my custom tokenizer and component?

Q2. Are there any commercially available Japanese NE data-sets?
A2. Unfortunately, KWDLC is the only solution.

My opinions:
According to the KWDLC license, we have a responsibility to remove the sentences when the copyright holder requests that. It may cause some problems in the future.

I'm going to make NE annotations over UD_Japanese-PUD in two weeks.
Can I use Prodigy as an annotation tool to do that?

Thanks,

@polm
Copy link
Contributor Author

polm commented Jun 18, 2019

Thanks for the update!

We'd like to create a new data-set which contains over 10,000 sentences with dependency and named-entity gold annotations under OSS license which allows commercial use, by the end of this year

This is great news!

The license of GSD data-set is CC-BY-NC-SA. It is 'gray' in Japanese law if someone uses the probabilistic models trained on the 'NC' data-set for commercial purposes (but in other jurisdictions, it might be allowed).

This is surprising to me - I was under the impression that trained models were fine to distribute after the recent Japanese copyright law changes.

Anyway, I think getting a full pipeline working with data that exists now sounds good, and it's wonderful to hear data with a clear license will be available later this year. If there's anything I can help with please feel free to @ me.

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 18, 2019

@polm Sure. We had big changes in the copyright law in Japan at the beginning of this year.

In my humble opinion, the new copyright law allows us to publish the data-sets extracted from the public web (except the ready-made data-sets with no-republish-clause), and everyone in Japan can train and use machine learning models just for the internal use with those open data-sets even for commercial purposes.

But it's still in the gray-zone if we'd publish the models trained on the data-sets having no-republish-clauses or non-commercial-use-clauses, with commercial-use capability in the license of the models. This is why using GSD has the risks and I recommended to use PUD for the early versions of spaCy Japanese model.

Thanks a lot!

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 7, 2020

This is the draft of meta.json for Japanese model and I'd like to dump meta.json with the options json.dump(meta, f, ensure_ascii=False, indent=2) to make Japanese text readable.
(The TAG list also contains Japanese text.)

I confirmed following contents to the representatives of NINJAL and Work Applications.
I would like to complete it here.

(Also set vectors.name as "ja_vectors-chive-1.1-mc90-500k".)

{
  "name":"core_web_lg",
  "license":"CC BY-SA 4.0",
  "author":"Explosion and Megagon Labs Tokyo",
  "url":"https://explosion.ai",
  "email":"contact@explosion.ai",
  "description":"Japanese multi-task CNN trained on UD_Japanese-GSD v2.6-NE. Assigns word2vec token vectors, POS tags, dependency parse and named entities.",
  "sources":[
    {
      "name":"UD_Japanese-GSD v2.6",
      "url":"https://github.com/UniversalDependencies/UD_Japanese-GSD",
      "license":"CC BY-SA 4.0"
    },
    {
      "name":"UD_Japanese-GSD v2.6-NE",
      "author":"Megagon Labs Tokyo",
      "url":"https://github.com/megagonlabs/UD_Japanese-GSD",
      "license":"CC BY-SA 4.0",
      "citation":"Matsuda, Omura, Asahara, et al. UD Japanese GSD の再整備と固有表現情報付与. 言語処理学会第26回年次大会発表, 2020"
    },
    {
      "name":"chiVe: Japanese Word Embedding with Sudachi & NWJC",
      "author":"Works Applications",
      "url":"https://github.com/WorksApplications/chiVe",
      "license":"Apache License, Version 2.0",
      "citation":"Manabe, Oka, et at. 複数粒度の分割結果に基づく日本語単語分散表現. 言語処理学会第25回年次大会, 2019"
    },
    {
      "name":"SudachiPy",
      "author":"Works Applications",
      "url":"https://github.com/WorksApplications/SudachiPy",
      "license":"Apache License, Version 2.0",
      "citation": "Takaoka, Hisamoto, et al. Sudachi: a Japanese Tokenizer for Business. LREC, 2018"
    },
    {
      "name":"SudachiDict",
      "author":"Works Applications",
      "url":"https://github.com/WorksApplications/SudachiDict",
      "license":"Apache License, Version 2.0",
      "citation": "Sakamoto, Kawahara, et al. 形態素解析器『Sudachi』のための大規模辞書開発. 言語資源活用ワークショップ, 2018"
    }
  ],
  "spacy_version":">=2.3.0",
  "parent_package":"spacy"
}

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 7, 2020

I added a commit to above PR.
981d077

We're using Tokenizer.SpiltMode.A for SudachiPy but we can not change it after creating Japanese language instance.
We might change the split mode to B or C in future model releases and I think we need to be able to set up SudachiPy split mode from the tokenizer entry in meta.json.

@hiroshi-matsuda-rit
Copy link
Contributor

I just sent two PRs. Please review again.
#5560
#5561

@hiroshi-matsuda-rit
Copy link
Contributor

#5561 was replaced with #5562 and extended to be able to serialize the split_mode.
Thank you so much! @adrianeboyd

@hiroshi-matsuda-rit
Copy link
Contributor

I confirmed the contents of meta.json to the representatives of NINJAL and Work Applications.
I’d like to complete it.
#3756 (comment)

@adrianeboyd
Copy link
Contributor

@hiroshi-matsuda-rit: Thank you for providing the meta information! Our model training setup uses standardized vector names, but I will add the chive version from above to the metadata. The meta.json in the model package is going to be saved in unreadable ASCII, but the information shown in the loaded model as nlp.meta should be readable.

@hiroshi-matsuda-rit
Copy link
Contributor

@adrianeboyd Thank you for your corporation!
I'm very grateful to you for setting the word vectors name as I mentioned.
An I understand that the changes on very important files, such as model's meta.json, might cause severe damage to the applications.
I agree with your decision not to use multi-byte chars in meta.json.

By the way, I have an suggestion about adding a user_data field.
I've been asked by the users of GiNZA how to reference the reading_form of SudachiPy.
Do you have any plan to add a field containing pronunciations to Token?
Or shall I add a user_data entry for the reading_forms?

@hiroshi-matsuda-rit
Copy link
Contributor

I found a problem in my model training settings.
We should not use --learn-token option for spacy train because we put doc.user_data[] fields aligned to tokens in the doc, though it would reduce accuracy entirely.
It was obvious for you all but I did not understand the negative effect of --leran-token.
Sorry for that.

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Jun 10, 2020

No worries, we're not using --learn-tokens in any provided models because it has too many side effects like this. (The main one is that the position of the parser in the pipeline affects the results of the other models, which have been trained on the original segmentation. This would cause many usability / user support headaches, especially since the typical pipeline examples have the parser before NER.)

In terms of the tokenizer, I would caution against adding too many features that would slow down the tokenizer, but if sudachipy is already calculating and providing this information, I think saving it as a custom token extension sounds reasonable. You'd want to run some timing tests to make sure it's not slowing the tokenizer down too much.

I've also removed the sentence segmentation from the tokenizer because it wasn't splitting sentences as intended. The default models will work like other languages in spacy (where the parser splits sentences) and if you want to provide this functionality as a optional pipeline component like sentencizer, this can be added in the future.

@hiroshi-matsuda-rit
Copy link
Contributor

hiroshi-matsuda-rit commented Jun 10, 2020

Thanks for detailed descriptions about --learn-token.
Now I understand its side-effects well.

I agree with removing sentence splitter in __init__.py (even I'm not the author of it) and use parser based splitting.

In addition, I'm developing a pipeline component which assigns Japanese bunsetu phrase structure to Doc.user_dara[], and going to deploy to spaCy Universe as polm-san mentioned before.
But the current master branch contains its prototype as spacy.lang.ja.bunsetu, maybe unintentionally.
I'd like to remove it in next PR.

Now, I'm investigating the morphanalysis.py and found some useful properties, like inf_form and verb_form to store the inflection type informations of the Japanese predicates.
Can I use them instead of Doc.user_data[]?
And do you have appropriate field to store the pronunciation string for each token?

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Jun 10, 2020

The MorphAnalysis is going to change a lot for spacy v3, so please don't develop anything for spacy v2 related to this. The morphological analyzer was never fully implemented in spacy v2 and is remaining undocumented / unsupported for v2.3, too. You can access Token.morph for morphological features provided by tag maps, but we're not advertising any of this because the API is going to change for v3.

For spacy v3 we have modified MorphAnalysis to allow arbitrary features, more or less anything that could appear in the UD FEATS column, and there will be a statisical morphologizer model that predicts POS+FEATS. If you do want to get started now, please have a look at the develop branch for the new API. It is still a bit unstable and the recent Japanese languages changes haven't been merged there yet, but we'll switch all development to v3 shortly after v2.3.0 is released and get that updated soon.

There's no field for storing pronunciations, so a custom extension is probably a good place for this for now.

@hiroshi-matsuda-rit
Copy link
Contributor

I see. I'm going to add doc.user_data['reading_forms'] to store the pronunciations at this moment, and want to read new morphology features of v3.0 in develop.
Thank you again!

@adrianeboyd
Copy link
Contributor

If anyone would like to test the upcoming Japanese models, the initial models have been published and can be tested with spacy v2.3.0.dev1:

pip install spacy==2.3.0.dev1
pip install https://github.com/explosion/spacy-models/releases/download/ja_core_news_sm-2.3.0/ja_core_news_sm-2.3.0.tar.gz --no-deps

Replace sm with md or lg for models with vectors.

@HiromuHota
Copy link
Contributor

I quickly tested the dev1 but had to install sudachidict_core too. So the installation command would be

pip install spacy==2.3.0.dev1
pip install https://github.com/explosion/spacy-models/releases/download/ja_core_news_sm-2.3.0/ja_core_news_sm-2.3.0.tar.gz --no-deps
pip install sudachidict_core

@polm
Copy link
Contributor Author

polm commented Jun 10, 2020

@HiromuHota I think the necessary Sudachi packages, including the dictionary, should be installed if you specify spacy[ja].


I only did a very cursory check but output looks OK so far.

@HiromuHota
Copy link
Contributor

@polm I confirmed that it worked. So it should be

pip install spacy[ja]==2.3.0.dev1
pip install https://github.com/explosion/spacy-models/releases/download/ja_core_news_sm-2.3.0/ja_core_news_sm-2.3.0.tar.gz --no-deps

@hiroshi-matsuda-rit
Copy link
Contributor

I completed to add new features to doc.user_data[] and test cases.
https://github.com/hiroshi-matsuda-rit/spaCy/compare/master...hiroshi-matsuda-rit:feature/japanese/reading_forms_and_refactors2?expand=1

But py.test never executes new test cases in my environment. Uh... please help me.

$ py.test -k "sub_tokens" spacy/tests/lang/ja/test_tokenizer.py
========================================================================================================= test session starts ==========================================================================================================
platform linux -- Python 3.8.3, pytest-5.4.3, py-1.8.1, pluggy-0.13.1
rootdir: /mnt/c/git/spaCy, inifile: setup.cfg
plugins: timeout-1.3.4
collected 103 items / 102 deselected / 1 selected                                                                                                                                                                                      

spacy/tests/lang/ja/test_tokenizer.py s                                                                                                                                                                                          [100%]

================================================================================================== 1 skipped, 102 deselected in 0.11s ==================================================================================================

@HiromuHota
Copy link
Contributor

@hiroshi-matsuda-rit Check if LANGUAGES in spacy/tests/lang/test_initialize.py includes "ja". Hope this helps.

@hiroshi-matsuda-rit
Copy link
Contributor

@HiromuHota Thanks you! I added ja to test_initialize.py but still ja/* are all skipped. What's wrong with me?

$ py.test spacy/tests/lang
...
spacy/tests/lang/it/test_prefix_suffix_infix.py ..
spacy/tests/lang/ja/test_lemmatization.py sssss
spacy/tests/lang/ja/test_serialize.py s
spacy/tests/lang/ja/test_tokenizer.py sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
spacy/tests/lang/ko/test_lemmatization.py sssss
spacy/tests/lang/ko/test_tokenizer.py sssssssss
spacy/tests/lang/lb/test_exceptions.py .......
...

@HiromuHota
Copy link
Contributor

HiromuHota commented Jun 10, 2020

I tested it myself

$ python -m pytest spacy/tests/lang/ja/test_tokenizer.py -v -rsXx -s -x
======================================================================================== short test summary info ========================================================================================
SKIPPED [99] /Users/hiromu/workspace/spaCy/spacy/tests/conftest.py:143: could not import 'fugashi': No module named 'fugashi'
SKIPPED [2] spacy/tests/lang/ja/test_tokenizer.py:61: sentence segmentation in tokenizer is buggy

Sorry the LANGUAGES had nothing to do here. I just installed fugashi and tests ran.

I think pytest.importorskip("fugashi") should now be removed like below.

$ git diff
diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py
index 63bbf2e0a..ef776a4f2 100644
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@@ -140,7 +140,6 @@ def it_tokenizer():
 
 @pytest.fixture(scope="session")
 def ja_tokenizer():
-    pytest.importorskip("fugashi")
     return get_lang_class("ja").Defaults.create_tokenizer()

@hiroshi-matsuda-rit
Copy link
Contributor

@HiromuHota You are using old master branch.
Fugashi was replaced with sudachipy.
Could you use the feature brach of my repo?
https://github.com/hiroshi-matsuda-rit/spaCy/tree/feature/japanese/reading_forms_and_refactors2

@HiromuHota
Copy link
Contributor

Even on your feature branch, pytest.importorskip("fugashi") is still there at https://github.com/hiroshi-matsuda-rit/spaCy/blob/feature/japanese/reading_forms_and_refactors2/spacy/tests/conftest.py#L143.
This line should be removed now that fugashi has been replaced with sudachipy, or should now be pytest.importorskip("sudachipy") or something like that.

@hiroshi-matsuda-rit
Copy link
Contributor

@HiromuHota Thanks a lot! All the the test cases of spacy.lang.ja were executed after I replaced "fugashi" to "sudachipy" in conftest.

@hiroshi-matsuda-rit
Copy link
Contributor

I just submitted a PR which enables users getting reading_forms, inflections, and sub_tokens from Doc.user_data[].
#5573

@hiroshi-matsuda-rit
Copy link
Contributor

I reported accuracy degrades of SudachiPy version 0.4.6-0.4.7.
They have solved the problems and already released v0.4.8.
We should specify sudachipy>=0.4.8 when releasing new Japanese models.

@svlandeg
Copy link
Member

I think this issue can be closed, as the first Japanese models were published with spaCy 2.3.2 and this thread seems to have gone quiet ;-)

If there are further problems or discussion needed with the Japanese models - feel free to open a new issue.

Huge thanks for everyone involved in this effort!

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements lang / ja Japanese language data and models models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests