Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolved token vs. character indexing issues #15

Merged
merged 1 commit into from
Oct 3, 2022
Merged

resolved token vs. character indexing issues #15

merged 1 commit into from
Oct 3, 2022

Conversation

davidberenstein1957
Copy link
Contributor

@davidberenstein1957 davidberenstein1957 commented Oct 1, 2022

Added character indexing, instead of the faulty, token based indexing. Note that doc.[index_x, index_y] refers to token, however, entity-fishing works with characters.
Additionally, now you also assign the entity_fishing variables to completely different spans, which will also likely overlap with one another.

Added character indexing, instead of the faulty, token based indexing.
Note that doc.[index_x, index_y] refers to token, however, entity-fishing works with characters.
@davidberenstein1957
Copy link
Contributor Author

@Lucaterre I also wrote some code, to be able to include entities, and Wikipedia matches from the entity-fishing API too, however, since this is a more elaborate feature, I want to get your input on this before creating a PR.

@davidberenstein1957
Copy link
Contributor Author

Also, currently the pipeline isn´t optimized for using the .pipe function in spaCy. I can help to add this too.

@Lucaterre Lucaterre added 🐛 bug Something isn't working ⏫ high priority labels Oct 3, 2022
@Lucaterre
Copy link
Owner

Lucaterre commented Oct 3, 2022

Hi @davidberenstein1957, thank you very much for this bug fix and for your contributions. 👍

I note that you are right: entity fishing use offsets at character level and not at token level, this is a mistake on my part. 😞

However, your implementation does not fully work (if you test it on the examples in the doc, attribute kb_qid returns None).

I need to added some complements so that Spacy uses the offsets well to recover the spans and update correctly entities in final doc output (replace index access eg. doc[start:end] with doc.char_span() method) and I update the nil_clustering part too. I'll take care of that, after tests checking and merging this PR.

I will create a new release after this, because it is an important fix. thanks again 😄

@Lucaterre Lucaterre merged commit 408cff7 into Lucaterre:main Oct 3, 2022
@davidberenstein1957
Copy link
Contributor Author

@Lucaterre https://hacktoberfest.com/participation/#pr-mr-details could you add your repo to Hacktoberfest please🤓
I want to get a tree planted in my honor.

@Lucaterre Lucaterre added hacktoberfest Issue/PR marked as suitable for Hacktoberfest hacktoberfest-accepted labels Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⏫ high priority 🐛 bug Something isn't working hacktoberfest Issue/PR marked as suitable for Hacktoberfest hacktoberfest-accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants