resolved token vs. character indexing issues #15

davidberenstein1957 · 2022-10-01T19:12:19Z

Added character indexing, instead of the faulty, token based indexing. Note that doc.[index_x, index_y] refers to token, however, entity-fishing works with characters.
Additionally, now you also assign the entity_fishing variables to completely different spans, which will also likely overlap with one another.

Added character indexing, instead of the faulty, token based indexing. Note that doc.[index_x, index_y] refers to token, however, entity-fishing works with characters.

davidberenstein1957 · 2022-10-01T19:13:31Z

@Lucaterre I also wrote some code, to be able to include entities, and Wikipedia matches from the entity-fishing API too, however, since this is a more elaborate feature, I want to get your input on this before creating a PR.

davidberenstein1957 · 2022-10-01T20:06:01Z

Also, currently the pipeline isn´t optimized for using the .pipe function in spaCy. I can help to add this too.

Lucaterre · 2022-10-03T10:44:46Z

Hi @davidberenstein1957, thank you very much for this bug fix and for your contributions. 👍

I note that you are right: entity fishing use offsets at character level and not at token level, this is a mistake on my part. 😞

However, your implementation does not fully work (if you test it on the examples in the doc, attribute kb_qid returns None).

I need to added some complements so that Spacy uses the offsets well to recover the spans and update correctly entities in final doc output (replace index access eg. doc[start:end] with doc.char_span() method) and I update the nil_clustering part too. I'll take care of that, after tests checking and merging this PR.

I will create a new release after this, because it is an important fix. thanks again 😄

…#15

davidberenstein1957 · 2022-10-03T10:53:13Z

@Lucaterre https://hacktoberfest.com/participation/#pr-mr-details could you add your repo to Hacktoberfest please🤓
I want to get a tree planted in my honor.

resolved token vs. character indexing issues

9c4775c

Added character indexing, instead of the faulty, token based indexing. Note that doc.[index_x, index_y] refers to token, however, entity-fishing works with characters.

Lucaterre added 🐛 bug Something isn't working ⏫ high priority labels Oct 3, 2022

Lucaterre merged commit 408cff7 into Lucaterre:main Oct 3, 2022

Lucaterre added a commit that referenced this pull request Oct 3, 2022

Update spans access in doc with correct offsets and char_span metohd - …

68cf3e4

…#15

Lucaterre mentioned this pull request Oct 3, 2022

Davidberenstein1957 batch support #16

Merged

Lucaterre added hacktoberfest Issue/PR marked as suitable for Hacktoberfest hacktoberfest-accepted labels Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resolved token vs. character indexing issues #15

resolved token vs. character indexing issues #15

davidberenstein1957 commented Oct 1, 2022 •

edited

Loading

davidberenstein1957 commented Oct 1, 2022

davidberenstein1957 commented Oct 1, 2022

Lucaterre commented Oct 3, 2022 •

edited

Loading

davidberenstein1957 commented Oct 3, 2022

resolved token vs. character indexing issues #15

resolved token vs. character indexing issues #15

Conversation

davidberenstein1957 commented Oct 1, 2022 • edited Loading

davidberenstein1957 commented Oct 1, 2022

davidberenstein1957 commented Oct 1, 2022

Lucaterre commented Oct 3, 2022 • edited Loading

davidberenstein1957 commented Oct 3, 2022

davidberenstein1957 commented Oct 1, 2022 •

edited

Loading

Lucaterre commented Oct 3, 2022 •

edited

Loading