-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Japanese Model #3756
Comments
Hi! I have the same opinion. I think there is a lot of demand to integrate Japanese model to spaCy. And actually, I'm preparing to integrate GiNZA to spaCy v2.1 and going to create some Pull Requests for this integration with a parser model trained on (formally licensed) UD-Japanese BCCWJ and a NER model trained on Kyoto University Web Document Leads Corpus (KDWLC), by the middle of June. Unfortunately, the UD-Japanese GSD corpus is almost obsoleted after UD v2.4 and will not be maintained by the UD community anymore. They are focusing on developing BCCWJ version from v2.3. Our lab (Megagon labs) has a mission to develop and publish the models trained on UD-Japanese BCCWJ with the licenser (NINJAL). Because of the licensing problems (mostly from the newspaper) and the difference among the word segmentation policies coming with POS systems, there are some difficulties to train and publish commercial-use NER models. But, there is a good solution which I've used for GiNZA. The Kyoto University Web Document Leads Corpus has high quality gold-standard annotations containing NER spans with popular NER categories. I used a part of KWDLC's NER annotations which meet to the boundaries of Sudachi morphological analyzer (mode C) to train our models. How do you think this approach? @polm |
If you're able to prepare the dataset files, I'd definitely be glad to add a Japanese model to the training pipeline. For the NER model, I would hope that better text than Wikipedia could be found. Wikipedia is a fairly specific genre, so models trained on Wikipedia text aren't always broadly useful. Perhaps text from the common crawl could be used? |
@hiroshi-matsuda-rit I think that sounds great! Let me know if there's anything I can help with.
I had not realized this, that's really unfortunate.
I was not aware this had good NER annotations, so that sounds like a great option. My only concern is the license is somewhat unusual so I'm not sure how the spaCy team or downstream users would feel about using a model based on the data. The thing I thought might be a problem with BCCWJ and similar data is that based on #3056 my understanding is that the spaCy maintainers want to have access to the training data, not just compiled models, so that the models can be added to the training pipeline and updated whenever spaCy's code is updated (see here). @honnibal Are you open to adding data to the training pipeline that requires a paid license, like the BCCWJ? I didn't see a mention of any cases like that in #3056 so I wasn't sure...
That makes sense, though it seems easier to use the Crawl for word vectors. Is there a good non-manual way to make NER training data from the Common Crawl? I thought of starting with Wikipedia since the approach in Nothman et al (which I think was/is used for some spaCy training data?) seemed relatively quick to implement; I've posted a little progress on that here. I guess with the Common Crawl a similar strategy could be used, looking for links to Wikipedia or something? With a quick search all I found was DepCC, which seems to just use the output of the Stanford NER models for NER data. |
The non-manual approaches are pretty limited, really. If you're able to put a little time into it, we could give you a copy of Prodigy? You should be able to get a decent training corpus with about 20 hours of annotation.
We're open to buying licenses to corpora, sure, so long as once we buy the license we can train and distribute models for free. We've actually paid quite a lot in license fees for the English and German models. If the license would require users to buy the corpus, it's still a maybe, but I'm less enthusiastic. |
If 20 hours is enough then I'd be glad to give it a shot! I figured it'd take much longer than that. |
Sure. I think it would be better if training datasets were published under OSS license even if only small part of BCCWJ. I'm going to ask this dataset publishing issue to the licenser. Please wait just few days.
If we could use common crawl texts without worrying about the license problems, I think that KWDLC would be a good option for NER training corpus because KWDLC's annotation data has no limitation for commercial use and its entire text is retrieved from the web in much same way as common crawl.
I'm very happy to hear that and would like your helps. Thanks a lot! Please review my branches later.
Kanayama-san, he is one of the founding members of UD-Japanese community, has updated UD-Japanese GSD master branch some weeks ago.
I guess that we should use the datasets having well-known license such as MIT or GPL to train the language models which spaCy supports officially. Thanks! |
@polm @honnibal Good news for us! I'd like to prepare a new spaCy model trained on UD-Japanese GSD (parser) and KWDLC (NER) in a few days. I'd use GiNZA's logics but it will be much smarter than previous version of GiNZA. I'm refactoring the pipeline (I dropped dummy root token). |
@polm @honnibal https://github.com/megagonlabs/ginza/releases/tag/v1.1.0-gsd_preview-1 I used the latest UD-Japanese GSD dataset and KWDLC to train that model. Parsing accuracy (using separated test dataset):
NER accuracy (using same dataset for both train and test, sorry):
I'm still refactoring the source codes to resolve two important issues below:
|
I've done the refactoring to use entry points. I'd add one more improvement around the accuracy of root token's pos, tonight. |
I've implemented a rule-based logic for "Sa-hen" noun pos disambiguation and released the final alpha version. https://github.com/megagonlabs/ginza/releases/tag/v1.1.2-gsd_preview-3 I'm going to add some documents and tests, and then create a PR to integrate the new ja language class and models to spaCy master branch. Please review above codes and give me some feedback. |
Thank you, @hiroshi-matsuda-rit san, for your work of https://github.com/megagonlabs/ginza/releases/tag/v1.1.2-gsd_preview-3 but I could not find |
I'm so sorry for that. I added it. Please try downloading again. @KoichiYasuoka |
Thank you again, @hiroshi-matsuda-rit san, and I've checked some attributes in Token https://spacy.io/api/token as follows:
and I found several |
I've ever tried to store detailed part-of-speech information to tag_ but could not do that without modifying data structure of spaCy token, and unfortunately this modification needs C level recompilation and reduces compatibility with other languages. I'd like to read spaCy's source codes again and report whether we could store language specific tag to token.tag_ or not. @KoichiYasuoka |
I've found that the recent spaCy versions can change token.pos_. |
Thank you again and again, @hiroshi-matsuda-rit san, I'm just trying
So as 「猫である」 or 「人である」 etc. Then 「である」 seems to have wrong Key for
|
One more thing, @hiroshi-matsuda-rit san, how do you think about to change |
@KoichiYasuoka Thanks you! Actually, I've improved GiNZA's codes on the problems which your reported above, and just started refactoring for training procedure. |
@honnibal I'd like to switch the discussing place from #3818 to here because it is Japanese model specific issue. I'm trying to use spaCy's train command with SudachiTokenizer to create Japanese model as I mentioned in #3818.
I added probe code to arc_eager.set_cost() like below, and found some fragmented content might be created by Levenshtein alignment procedure.
The printed content for the sentence "高橋の地元、盛岡にある「いわてアートサポートセンター」にある風のスタジオでは、地域文化芸術振興プランと題して「語りの芸術祭inいわて盛岡」と呼ばれる朗読会が過去に上演されてお り、高橋の作品も何度か上演されていたことがある。" is:
The missing words are "いわて", "題し", "さ", "れ", and "何度" is aligned twice after the following word. When I used only the former part of train dataset (before that sentence), the error did not occurred but the UAS value decreased in 5 epochs. |
@hiroshi-matsuda-rit Thanks, it's very possible my alignment code could be wrong, as I struggled a little to develop it. You might find it easier to develop a test case with the problem if you're calling it directly. You can find the tests here: https://github.com/explosion/spaCy/blob/master/spacy/tests/test_align.py It may be that there's a bug which only gets triggered for non-Latin text, as maybe I'm doing something wrong with respect to methods like |
@honnibal Thanks too, I'd like to try making some tests including both Latin and non-Latin ones. |
@polm : Of course your mileage may vary, but the annotation tool is quite quick, especially if you do the common entities one at a time, and possibly use things like pattern rules. Even a small corpus can start to get useful accuracy, and once the initial model is produced, it can be used to bootstrap. If you want to try it, send us an email? contact@explosion.ai |
@honnibal : I've tested your alignment function for several hours and I found that the current implementation is working correctly even for non-latin characters. I found it was a possible alignment, which I showed above, hence there is no need to add the test cases to test_align.py anymore. |
@honnibal Finally, I've solved all the problems and now train command works well with SudachiTokenizer. The accuracy is getting better as the epoch proceeds. I found three issues which prevent executing subtok unification.
|
@honnibal @polm I'd like to report on the result of UD-Japanese meeting held in Kyoto yesterday. Q1. Is the UD_Japanese-GSD data-set suitable for a training data for official Japanese model of spaCy? We think it's safer to use UD_Japanese-PUD under CC-BY-SA license. My opinions:
I've published a PUD-based Japanese lang model as a preview version. It was trained on only 900 sentences but the accuracy is not so bad. Shall I publish this PUD-based JSON files and send PR with my custom tokenizer and component? Q2. Are there any commercially available Japanese NE data-sets? My opinions: I'm going to make NE annotations over UD_Japanese-PUD in two weeks. Thanks, |
Thanks for the update!
This is great news!
This is surprising to me - I was under the impression that trained models were fine to distribute after the recent Japanese copyright law changes. Anyway, I think getting a full pipeline working with data that exists now sounds good, and it's wonderful to hear data with a clear license will be available later this year. If there's anything I can help with please feel free to @ me. |
@polm Sure. We had big changes in the copyright law in Japan at the beginning of this year. In my humble opinion, the new copyright law allows us to publish the data-sets extracted from the public web (except the ready-made data-sets with no-republish-clause), and everyone in Japan can train and use machine learning models just for the internal use with those open data-sets even for commercial purposes. But it's still in the gray-zone if we'd publish the models trained on the data-sets having no-republish-clauses or non-commercial-use-clauses, with commercial-use capability in the license of the models. This is why using GSD has the risks and I recommended to use PUD for the early versions of spaCy Japanese model. Thanks a lot! |
This is the draft of I confirmed following contents to the representatives of NINJAL and Work Applications. (Also set
|
I added a commit to above PR. We're using Tokenizer.SpiltMode.A for SudachiPy but we can not change it after creating Japanese language instance. |
#5561 was replaced with #5562 and extended to be able to serialize the split_mode. |
I confirmed the contents of |
@hiroshi-matsuda-rit: Thank you for providing the meta information! Our model training setup uses standardized vector names, but I will add the chive version from above to the metadata. The |
@adrianeboyd Thank you for your corporation! By the way, I have an suggestion about adding a user_data field. |
I found a problem in my model training settings. |
No worries, we're not using In terms of the tokenizer, I would caution against adding too many features that would slow down the tokenizer, but if sudachipy is already calculating and providing this information, I think saving it as a custom token extension sounds reasonable. You'd want to run some timing tests to make sure it's not slowing the tokenizer down too much. I've also removed the sentence segmentation from the tokenizer because it wasn't splitting sentences as intended. The default models will work like other languages in spacy (where the parser splits sentences) and if you want to provide this functionality as a optional pipeline component like |
Thanks for detailed descriptions about --learn-token. I agree with removing sentence splitter in In addition, I'm developing a pipeline component which assigns Japanese bunsetu phrase structure to Doc.user_dara[], and going to deploy to spaCy Universe as polm-san mentioned before. Now, I'm investigating the morphanalysis.py and found some useful properties, like inf_form and verb_form to store the inflection type informations of the Japanese predicates. |
The For spacy v3 we have modified There's no field for storing pronunciations, so a custom extension is probably a good place for this for now. |
I see. I'm going to add doc.user_data['reading_forms'] to store the pronunciations at this moment, and want to read new morphology features of v3.0 in develop. |
If anyone would like to test the upcoming Japanese models, the initial models have been published and can be tested with spacy v2.3.0.dev1:
Replace |
I quickly tested the dev1 but had to install pip install spacy==2.3.0.dev1
pip install https://github.com/explosion/spacy-models/releases/download/ja_core_news_sm-2.3.0/ja_core_news_sm-2.3.0.tar.gz --no-deps
pip install sudachidict_core |
@HiromuHota I think the necessary Sudachi packages, including the dictionary, should be installed if you specify I only did a very cursory check but output looks OK so far. |
@polm I confirmed that it worked. So it should be pip install spacy[ja]==2.3.0.dev1
pip install https://github.com/explosion/spacy-models/releases/download/ja_core_news_sm-2.3.0/ja_core_news_sm-2.3.0.tar.gz --no-deps |
I completed to add new features to doc.user_data[] and test cases. But
|
@hiroshi-matsuda-rit Check if |
@HiromuHota Thanks you! I added ja to test_initialize.py but still ja/* are all skipped. What's wrong with me?
|
I tested it myself
Sorry the I think
|
@HiromuHota You are using old master branch. |
Even on your feature branch, |
@HiromuHota Thanks a lot! All the the test cases of spacy.lang.ja were executed after I replaced "fugashi" to "sudachipy" in conftest. |
I just submitted a PR which enables users getting reading_forms, inflections, and sub_tokens from Doc.user_data[]. |
I reported accuracy degrades of SudachiPy version 0.4.6-0.4.7. |
I think this issue can be closed, as the first Japanese models were published with spaCy 2.3.2 and this thread seems to have gone quiet ;-) If there are further problems or discussion needed with the Japanese models - feel free to open a new issue. Huge thanks for everyone involved in this effort! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Feature description
I'd like to add a Japanese model to spaCy. (Let me know if this should be discussed in #3056 instead - I thought it best to just tag it in for now.)
The Ginza project exists, but currently it's a repackaging of spaCy rather than a model to use with normal spaCy, and I think some of the resources it uses may be tricky to integrate from a licensing perspective.
My understanding is that the main parts of a model now are 1. the dependency model, 2. NER, and 3. word vectors. Notes on each of those:
Dependencies. For dependency info we can use UD Japanese GSD. UD BCCWJ is bigger but the corpus has licensing issues. GSD is rather small but probably enough to be usable (8k sentences). I have trained it with spaCy and there were no conversion issues.
NER. I don't know of a good dataset for this; Christopher Manning mentioned the same problem two years ago. I guess I could make one based on Wikipedia - I think some other spaCy models use data produced by Nothman et al's method, which skipped Japanese to avoid dealing with segmentation, so that might be one approach. (A reasonable question here is: what do people use for NER in Japanese? Most tokenizer dictionaries, including Unidic, have entity-like information and make it easy to add your own entries, so that's probably the most common approach.)
Vectors. Using JA Wikipedia is no problem. I haven't worked with the Common Crawl before and I'm not sure I have the hardware for it, buf if I could get some help on it that's also an option.
So, how does that sound? If there's no issues with that I'll look into creating an NER dataset.
The text was updated successfully, but these errors were encountered: