-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Span Finder Suggester #10
Span Finder Suggester #10
Conversation
Very nice! This is going to be annoying, but I think that we need to avoid the abbreviation |
@adrianeboyd how about instead of "sbd" -> "span_bd"? 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great Edi!
From a practical point, I'm wondering whether we need all the different config files in all the different subdirectories? It's great to support various different datasets, but what if we could make the decision very early on:
- You process one of the three available datasets, with one of the three specific commands (cf below)
- The processed
.spacy
output is stored in the same subdir of the project, regardless of what dataset had been processed - Now, all consequent steps don't have to bother about where the data came from originally
Is that feasible? Or are there things you're doing different in the scripts, depending on what dataset it was (besides preprocessing)
Alright, I've renamed every reference to I've also adjusted the spaCy project, added more commands, and reduced the number of configs, so that now there is only one config for all datasets. |
As the next minor adjustment, I think that |
…omashacker/spacy-experimental into feature/spanboundarydetection
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Can you rename:
In the project, I'd also like to have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I found everything, but can you do another pass through the docs, project, and tests?
The remaining issues:
|
Updated the key in the suggester and looked over the docs and the project. I also checked the spaCy project and ran the workflows with different configurations. |
This
PR
adds a new experimental component for learning span boundaries and a custom suggester function for spancat.It further adds a spaCy project showcasing how to use the SpanFinder component on 3 different datasets (Healthsea, ToxicSpans, Genia) with 2 configurations (tok2vec & transformer). The project also provides the possibility to train spancat with ngram and compare it to SpanFinder with a custom evaluation script that calculates the performance and overall coverage of the suggester functions.
Features