New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix use_gold_ents behaviour for EntityLinker #13400

Merged

svlandeg merged 11 commits into explosion:master from svlandeg:fix/el

Apr 16, 2024

Member

svlandeg commented Mar 27, 2024

Description

The use_gold_ents flag was introduced to allow the entity_linker to train on gold entities, even if there's no (annotating) NER component in the pipeline.

I think this behaviour was buggy because of a few reasons:

In initialize(), NER "predictions" from eg.reference were added to eg.predicted for the first 10 examples, and never cleaned up/restored afterwards.
In update(), this transfer happened on all examples, but here the ents were "restored" before calling the loss function. In theory, this should have prevented the EL to learn anything at all, except that in the corresponding unit test, this bug got masked by bug 1, which resulted in a few spurious annotations on the first 10 documents
Because of how a spaCy pipeline works internally, the scoring could never work out of the box, because Language.evaluate() calls pipe() on the predicted docs, which won't have entities if there is no (annotating) NER in the pipeline.

To test some of this behaviour, I used different configs with the EL Emerson example, cf explosion/projects#207. The "EL only" config would produce all-zero lines with master:

E    #       LOSS ENTIT...  NEL_MICRO_F  NEL_MICRO_R  NEL_MICRO_P  SCORE
---  ------  -------------  -----------  -----------  -----------  ------
  0       0           0.00         0.00         0.00         0.00    0.00
 33     200           0.00         0.00         0.00         0.00    0.00
 73     400           0.00         0.00         0.00         0.00    0.00
123     600           0.00         0.00         0.00         0.00    0.00

Then it would produce actual loss scores after fixing 1 and 2:

E    #       LOSS ENTIT...  NEL_MICRO_F  NEL_MICRO_R  NEL_MICRO_P  SCORE
---  ------  -------------  -----------  -----------  -----------  ------
  0       0           3.30         0.00         0.00         0.00    0.00
 33     200          55.77         0.00         0.00         0.00    0.00
 73     400           3.93         0.00         0.00         0.00    0.00
123     600           1.99         0.00         0.00         0.00    0.00

And finally, after fixing 3, it would give actual scores:

E    #       LOSS ENTIT...  NEL_MICRO_F  NEL_MICRO_R  NEL_MICRO_P  SCORE
---  ------  -------------  -----------  -----------  -----------  ------
  0       0           3.30        33.33        33.33        33.33    0.33
 33     200          57.55        83.33        83.33        83.33    0.83
 74     400           4.25        83.33        83.33        83.33    0.83
124     600           1.95        83.33        83.33        83.33    0.83

Types of change

bug fixes & enhancement

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

svlandeg added 6 commits

March 27, 2024 15:05


          fix type annotation in docs

2b0a3e2


          only restore entities after loss calculation

42188b3


          restore entities of sample in initialization

d41d875


          rename overfitting function

76d77f0


          fix EL scorer

7ea8c4a


          Relax test

ff88ab3

svlandeg added bug feat / nel labels


          fix formatting

a1cde9d

svlandeg commented

View reviewed changes

spacy/pipeline/entity_linker.py Outdated

Comment on lines 244 to 258

+                      def _score_augmented(examples, **kwargs):
+                          # Because of how spaCy works, we can't just score immediately, because Language.evaluate
+                          # calls pipe() on the predicted docs, which won't have entities if there is no NER in the pipeline.
+                          if not self.use_gold_ents:
+                              return scorer(examples, **kwargs)
+                          else:
+                              examples = self._augment_examples(examples)
+                              docs = self.pipe(
+                                  (eg.predicted for eg in examples),
+                              )
+                              for eg, doc in zip(examples, docs):
+                                  eg.predicted = doc
+                              return scorer(examples, **kwargs)
+                      self.scorer = _score_augmented

Member Author

svlandeg Mar 27, 2024

This whole bit is surely pretty hacky, but considering bug 3 as explained in the PR, I don't see a better option other than changing the entire mechanism how evaluation/scoring of a pipeline works...

Contributor

rmitsch Mar 30, 2024

Agreed, this is not really satisfying. The workaround makes sense in this context though.

svlandeg commented

View reviewed changes

spacy/pipeline/entity_linker.py

+                      new_examples = []
+                      for eg in examples:
+                          ents, _ = eg.get_aligned_ents_and_ner()
+                          new_eg = eg.copy()

Member Author

svlandeg Mar 27, 2024

Making a copy here feels safest? Not 100% about all the possible interactions with all other components in the pipeline, before or after, annotated or not, and frozen or not...

Contributor

rmitsch Mar 30, 2024

Hm, do we manipulate examples in other components? I'm also unsure about this. Either way 👍 for copying it.

rmitsch reviewed

View reviewed changes

Contributor

rmitsch left a comment

Great spot 👀 Can you elaborate on the cleaning up/restoration from reason 1.? Not sure what you mean by that.

spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved

spacy/pipeline/entity_linker.py Outdated

Comment on lines 244 to 258

+                      def _score_augmented(examples, **kwargs):
+                          # Because of how spaCy works, we can't just score immediately, because Language.evaluate
+                          # calls pipe() on the predicted docs, which won't have entities if there is no NER in the pipeline.
+                          if not self.use_gold_ents:
+                              return scorer(examples, **kwargs)
+                          else:
+                              examples = self._augment_examples(examples)
+                              docs = self.pipe(
+                                  (eg.predicted for eg in examples),
+                              )
+                              for eg, doc in zip(examples, docs):
+                                  eg.predicted = doc
+                              return scorer(examples, **kwargs)
+                      self.scorer = _score_augmented

Contributor

rmitsch Mar 30, 2024

Agreed, this is not really satisfying. The workaround makes sense in this context though.

spacy/pipeline/entity_linker.py

+                      new_examples = []
+                      for eg in examples:
+                          ents, _ = eg.get_aligned_ents_and_ner()
+                          new_eg = eg.copy()

Contributor

rmitsch Mar 30, 2024

Hm, do we manipulate examples in other components? I'm also unsure about this. Either way 👍 for copying it.

spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved

spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved

spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved

spacy/tests/pipeline/test_entity_linker.py Show resolved Hide resolved

spacy/tests/pipeline/test_entity_linker.py Show resolved Hide resolved

spacy/tests/pipeline/test_entity_linker.py Show resolved Hide resolved

spacy/tests/pipeline/test_entity_linker.py Show resolved Hide resolved

svlandeg and others added 3 commits

April 2, 2024 09:29


          Update spacy/pipeline/entity_linker.py

ce7b51e

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>


          rename to _ensure_ents

1a2c379


          further rename

d7a250d

Member Author

svlandeg commented Apr 2, 2024 •

edited

Loading

Can you elaborate on the cleaning up/restoration from reason 1.?

When the pipeline gets initialized, all the individual components their initialize() method is being called on a set of Example objects. The entity linker was changing these examples by setting gold entities on the first 10 examples (to allow dimension inference with Thinc), and not cleaning up afterwards, leaving the examples in an inconsistent/wrong state for the next component/other processing.


          allow for scorer to be None

98747de

svlandeg merged commit 2e23346 into explosion:master

9 checks passed

svlandeg deleted the fix/el branch

April 16, 2024 10:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels