Fix entity linker batching #9669

polm · 2021-11-13T06:31:45Z

Description

The Entity Linker doesn't work with Listeners because it violates an (implicit?) requirement that the batch passed to the pipeline update must be the same as the batch passed to the model's forward using begin_update.

The reason the Entity Linker does this is that it builds sentence docs to get context around relevant entities. The reason this wasn't obviously a problem is that the EntityLinker predates the use of Listeners in spacy.

This PR changes the structure of the EntityLinker so that the batch is handled consistently with other components and works with Listeners. This moves the logic of building context inside the model.

There are a few issues with this that are still unresolved. The main one is that the EntityLinker only makes a prediction if entities are available. This has two issues.

What happens when no entities are available? This can happen with unlucky batching if enough training docs don't have entities. With no entities some of the calculations just don't work out. Maybe we need to provide some filler data?
How do we get the entities? The old code checked gold data for entities and used that to create candidates. In the model we don't have Examples, just Docs without gold data. So we need to either have the Entities set earlier in the pipeline, or copy them from gold docs. Which do we do, and do we make that a config option, or partly automate it, or something else?

With the first commit here, training an NER and NEL model together using listeners is possible using this user-provided test repo. But in order for this to work entities are unconditionally copied over from reference Docs when preparing the batch, which may not be desirable when training with NER, and is not really what we want for a final fix.

Types of change

Bug fix. Brought up in these issues: #9310, #9291.

#9575 seems to be a separate problem but it might make sense to fix it in this PR too.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

svlandeg

Thanks for looking into this Paul!

About the listener architecture: I think your proposed changes make sense (I'll look into a bit more detail when this is out of draft), though in general we should think about what should happen in these cases. As you point out, there is an implicit assumption here on how listeners should be used. We should either document that, or look into a kind of fall-back mechanism when the listener's cache fails. We can probably defer that to a future discussion/PR though.

About the entities: conceptually, I think we should allow either training the EL on gold entities (as before), or on predicted ones. The latter is perfectly possible with the recent "annotating components" feature. The EL should then take in a parameter that defines whether gold should be used (True by default) or no. In the latter case, the user needs to put the NER or any other component that writes to doc.ents in annotating_components. We can't really check this, though we can output a warning if annotating_components is empty at least.

spacy/ml/models/entity_linker.py

polm · 2021-11-16T06:02:36Z

The EL should then take in a parameter that defines whether gold should be used (True by default) or no.

I hadn't considered the possibility of adding a parameter, that sounds great. I'll work on doing that and I think with that change this can be out of draft.

For the broader issue, I think a doc note can be added on the current behavior pretty easily, and we can discuss whether the current requirements are desirable or not later.

polm · 2021-11-20T08:31:13Z

Failing tests are because for tests that use the default get_candidates for initialization, there are no entities on the example docs. Because there are no entities, no candidates for annotations are produced, and initialization fails.

In initialization specifically we can check if there are no NER annotations and add some fake ones so the network can run, I'm not sure if that's the right approach though.

polm · 2021-12-10T06:49:00Z

So the tests have been failing because the changes in architecture affect the loss calculation too.

In the old version of the component, there would be one Doc per gold entity with KB entry in the input to the loss function. The loss function calculates an embedding for each gold entity and compares those.

The issue is that in the revised version, the model doesn't know what's gold and not, it just uses all entities. This means that it gets predictions for NIL entities, which the loss function doesn't calculate a value for. So that has to be resolved somehow.

I guess these are the approaches:

In the loss calculation, ignore predictions for known NIL entities (my most recent commit tries this)
In the loss calculation, add empty/placeholder embeddings for known NIL entities (I tried this but the loss was too high for the test)
In training, remove entity annotations for gold entities with no KB ID (not tried yet)

I tried 2 first, but the loss was too high and the test consistently failed.

My most recent commit takes approach 1 and it passes - but only sometimes. Sometimes the loss is a little too high, and more rarely the loss goes to NaN. I'm not sure why it's not reproducible or why NaN would come up - seems there's some kind of overflow?

spacy/tests/pipeline/test_entity_linker.py::test_overfitting_IO
  /mnt/pool/code/spacy/env/lib/python3.9/site-packages/thinc/layers/layernorm.py:32: RuntimeWarning: overflow encountered in multiply
    d_xhat = N * dY - sum_dy - dist * var ** (-1.0) * sum_dy_dist

spacy/tests/pipeline/test_entity_linker.py::test_overfitting_IO
  /mnt/pool/code/spacy/env/lib/python3.9/site-packages/numpy/core/_methods.py:47: RuntimeWarning: invalid value encountered in reduce
    return umr_sum(a, axis, dtype, out, keepdims, initial, where)

spacy/tests/pipeline/test_entity_linker.py::test_overfitting_IO
  /mnt/pool/code/spacy/env/lib/python3.9/site-packages/numpy/core/fromnumeric.py:87: RuntimeWarning: invalid value encountered in reduce
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

spacy/tests/pipeline/test_entity_linker.py::test_overfitting_IO
  /mnt/pool/code/spacy/env/lib/python3.9/site-packages/thinc/backends/ops.py:755: RuntimeWarning: invalid value encountered in multiply
    gradient *= threshold / grad_norm

-- Docs: https://docs.pytest.org/en/stable/warnings.html

I am not sure about approach 3, I need to think about it more. I guess the model can use the get_candidates function to exclude entities from predictions, so maybe that's the right thing to do.

polm · 2021-12-10T09:37:25Z

Thinking about approach 3 some more, I don't think it'll work. The model doesn't know if an entity has a KB annotation or not (because that would imply cheating). It also doesn't currently have access to the get_candidates function, so it can't check that (the function is only in the component).

What we could do is, the component could "erase" entities that don't have KB annotations, or that would get NIL annotations... but I'm not sure how that would interact with other pipeline components. I guess we could restore entity annotations but then that gets messy.

I think there's nothing wrong in principle with the approach I'm taking now, though I'm not sure what's up with the math - even if it's wrong, I'd expect it to be consistent on CPU.

svlandeg

Thanks for the work on this Paul. We need to dig into the nan issue - I left some review comments. I think the key is in the missing backprop.

In general I'd like to see more unit tests here: test edge cases with no entities (predicted or gold), test with different scenarios of the use_gold_ents etc. It would also be good to have a specific test for both the legacy v1 as well as the new v2 architecture, to ensure both get initialized well, can train one or two steps, and can predict. The test can be written by loading an nlp from a minimal config that is filled with defaults depending on either architecture.

Some additional comments to your questions:

What happens when no entities are available? This can happen with unlucky batching if enough training docs don't have entities. With no entities some of the calculations just don't work out. Maybe we need to provide some filler data?

Is there a way to spot this very early on, like we do have empty docs, and prevent the neural network from predicting at all?

I guess these are the approaches:

In the loss calculation, ignore predictions for known NIL entities (my most recent commit tries this)

In the loss calculation, add empty/placeholder embeddings for known NIL entities (I tried this but the loss was too high for the test)

In training, remove entity annotations for gold entities with no KB ID (not tried yet)

The problem is that the EL model doesn't try to learn the ID for a given entity, it tries to get the embedding of the sentence as close as it can to the entity embedding. There can be no universal NIL embedding - that will mess up normal embeddings too much. So 2. won't work.

I think 1. should work though: similar to how we would mask missing information when calculating textcat loss, we should mask the NIL entity and not derive loss values from it.

spacy/ml/models/entity_linker.py

spacy/pipeline/entity_linker.py

polm · 2022-01-16T11:36:20Z

I think this should go green this time.

The issue was with the gradient for entities not in the kb. A gradient can't actually be calculated for them, and the model was only returning gradients it could calculate. But because the model doesn't have information about which entities are in the kb, that means there could be a mismatch in the number of calculated embeddings and the size of the gradient.

A mismatch like that should probably cause an error, but it seems it may have been accessing out of bounds memory or doing something else weird. So that's where the nans came from and is another problem that needs a followup.

Behavior of this with numpyops seems wrong, as instead of giving an error it produces nans, as though it's accessing out of bounds memory or something. See explosion/spaCy#9669.

polm · 2022-01-16T12:13:00Z

Test failure is about garbage collection in textcat...? Might need to update from master, will look more at this later.

svlandeg · 2022-01-17T08:25:33Z

That failing spancat test is really weird. I see that there are some commits added to this branch that aren't supposed to be there. Maybe rebase to the current master?

polm · 2022-01-17T09:56:13Z

Test still fails here, but it passes locally for me. Not sure what's up... (Also you're right it's spancat, I mistakenly said textcat before.)

EDIT: Ah nevermind, fails locally too. Hm...

polm · 2022-01-17T11:15:35Z

Looking at the failing test in detail, it looks like it is flaky in master - I ran it 20 times on master and it failed 4 times. Not really sure what's going on here...

svlandeg · 2022-01-17T13:52:22Z

That's really weird. Could you create a seperate PR, disable the flaky test, and create a follow-up ticket on our internal board to look into this in the near future? Then we can continue with this PR after the test has been disabled on master.

adrianeboyd · 2022-01-17T14:14:51Z

This looks bizarre. My bet is new version of numpy (1.22.0), but we should figure out why.

Edited: No, not numpy (which would have been weird, to be honest). It's not really a good example.

I thought there was brief train step in there, but it looks like this hasn't changed at all recently. Something in thinc?

Honestly no idea what the right type to use here is. ConfigValidationError seems wrong. Maybe a NotImplementedError?

svlandeg

I added a failing test for the v1 legacy architecture, I tested this locally with the corresponding legacy branch, but it fails because EntityLinker_v1 is never returned because model.name isn't matched.

Similarly, when running the NEL emerson project, the v2 architecture is returned. It looks like the new v2 code just runs in that case, probably because the data is extremely simple. But it seems to run more by coincidence because the legacy fallback isn't actually used.

svlandeg · 2022-02-03T12:34:36Z

spacy/pipeline/entity_linker.py

+    # Handle legacy model
+    if model.name == "spacy.EntityLinker.v1":


How did you verify that this works? I don't really see how it could, the model.name is never set to that specific string as far as I can see.

I think I completely overlooked this - it's clear looking at it now it won't work at all, since model.name is the automatically constructed Thinc name.

However looking over this I'm not sure the architecture name is available at all at this point. It's not in the model, it's not in the arguments here, and nlp.config doesn't seem to contain the config for the current component (and peeking at that would be nasty anyway).

I need to think about this more, but can we even check the architecture here?

My solution for the issue here is to look for the name of the new custom component that pulls out the spans in the thinc model name. If it's not there, I assume it's the old model.

That feels awfully messy, but I think it's the cleanest way to check things here, since the model name isn't actually available.

It would work for now, but the solution does feel a bit brittle & implicit, making it prone for bugs in the future.

I'm almost tempted to define a new attr that would record the factory, maybe something like

model.attrs["spacy_factory"] = {"EntityLinker": 2}

And then any internal check could verify the version, and if the right attribute is not available, assume it's below whatever you're checking, because it's a legacy one. In time I think this might be the most robust/clear solution, but it would require maintenance and making sure things stay consistent whenever we introduce a new model version...

polm · 2022-02-14T12:14:27Z

Still working on this - in order to get use_gold_ents = False working I had to redo the code that aligned predictions and gold KB embeddings, since it relied on the assumption that there was exactly one prediction per gold entity. I figured it out but need to clean it up and write tests for situations where the alignment actually comes up.

polm · 2022-02-16T07:36:23Z

So this works, and I have tests for it, but the code is kind of a mess.

The issue is that I need aligned entities but I'm not sure that any of the existing alignment functions do what I need. The issue is that the model produces an Entity Linker prediction for each entity on the predicted document. Since those predictions are just a list of predicted embeddings, in order to filter them it's not enough to get the aligned entities, the index relative to the predicted embeddings is necessary.

I think I can make this cleaner by getting the aligned entities before the forward pass, setting them on the predicted doc temporarily, and then restoring the predicted entities, but I need to check it. If that works it'll have the nice side effect of not doing calculations in the forward pass for instances that can't be backpropagated.

This doesn't actually work because the "aligned" ents are gold-only. But if I have a different function that returns the intersection, *then* this will work as desired.

Stale

This changes the process when gold ents are not used so that the intersection of ents in the pred and gold is used.

polm · 2022-02-18T09:43:38Z

OK, I have added a get_matching_ents function that returns only entities that are aligned between the predicted and reference docs in a function. That's the set of entities that we can backprop for. Now I have some design questions about this.

Is get_matching_ents a reasonable addition to the Example? Is there a better name for it? Not sure how to differentiate it from "aligned" ents more clearly...
When getting matching ents in the specific case of the EntityLinker, should it be necessary that labels of ents match as well as boundaries? My instinct is "no", but I don't have a strong justification for that. We could make it a parameter for the user but I feel like it's not a very interesting one.
There is a situation with the internal state in the EntityLinker that I don't know how to test for. If alignment is done incorrectly but the number of entities matches between predicted and gold docs, then the math will work and training can run but the results will be meaningless.

polm · 2022-02-18T10:35:11Z

Tests are now failing because EntityLinker.v1 isn't available in the registry, because the PR adding it to legacy needs to be coordinated with this.

I'm not really sure how to handle that... We could do a dev release of spacy-legacy and have this PR branch use it but that seems like too many moving parts, and it'll have to be ripped out anyway. I could put in a placeholder but then the test isn't meaningful. I guess we could also just xfail it until spacy-legacy is updated...?

svlandeg · 2022-02-21T15:10:10Z

I'm not really sure how to handle that... We could do a dev release of spacy-legacy and have this PR branch use it but that seems like too many moving parts, and it'll have to be ripped out anyway.

Why should we have to rip it out again? IIRC there's no issue when a function is declared in both spacy and spacy-legacy - the former should take precedence. And either way they should be the same thing, right?

Actually, you can go ahead and test locally whether this gives any issues when you have explosion/spacy-legacy#18 installed next to spaCy's master branch, and then try running some code with v1. Should hopefully be fine 🤞

polm · 2022-02-22T10:54:18Z

I have been testing it locally - with the patched version of spacy-legacy it works fine.

I thought that maybe using a special release of spacy-legacy in this PR branch would be too complicated, but on reflection I guess there's no issue with it - I may have misunderstood how resolution between spacy and spacy-legacy registries worked.

If it's not too complicated we can do the test release and update the dependencies in this branch.

svlandeg · 2022-02-22T19:43:10Z

Sorry, I should have been more clear. Ideally, we can just go ahead with a proper release of spacy-legacy including the two open PRs there at the moment. This shouldn't interfere with the current, normal master branch in spacy. Then that release is just a normal one, not a "test" one, and won't need to be pulled (because I agree that would be a hassle)

* Add assert that size hasn't changed for reduce mean backprop Behavior of this with numpyops seems wrong, as instead of giving an error it produces nans, as though it's accessing out of bounds memory or something. See explosion/spaCy#9669. * Add clean checks for all reduce functions This also saves the size in a variable so the actual data can be garbage collected. * Add minimal test There are no tests for most of the reduce layers... * Add size consistency check to softmax I think the math won't work out anyway if this is inconsistent, but checking here makes the cause of errors more explicit. * Use a decorator for consistent backprop This is cleaner than copying the other code. Maybe this decorator could go on the top of a layer too to be even cleaner? * Make backprop decorator work on forward This simplifies things a bit at the point of usage. * Add tests for all reduce methods * Update thinc/util.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * New approach to consistency checking Currently only in reduce_mean while I get feedback * Restructure arrayinfo creation * Formatting * Replace decorator with ArrayInfo everywhere * Update thinc/util.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename ainfo var * Relax typing * Change dtype to not pick up name * Remove leftover reference Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

svlandeg

This turned out to be a little more involved than we'd expected at the beginning, but with the new release of spacy-legacy, this is now finally good to merge 🎉

svlandeg · 2022-03-04T08:17:32Z

Nice work @polm !

This reverts commit 91acc3e.

polm added bug Bugs and behaviour differing from documentation feat / nel Feature: Named Entity linking labels Nov 13, 2021

polm requested a review from svlandeg November 15, 2021 10:58

svlandeg reviewed Nov 15, 2021

View reviewed changes

spacy/ml/models/entity_linker.py Outdated Show resolved Hide resolved

polm mentioned this pull request Nov 18, 2021

Add note on batch contract for listeners #9691

Merged

3 tasks

polm mentioned this pull request Nov 30, 2021

Add entity linker v1 explosion/spacy-legacy#18

Merged

svlandeg self-requested a review December 27, 2021 13:11

svlandeg reviewed Dec 27, 2021

View reviewed changes

spacy/ml/models/entity_linker.py Show resolved Hide resolved

spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved

spacy/pipeline/entity_linker.py Show resolved Hide resolved

polm mentioned this pull request Jan 16, 2022

Add assert that size hasn't changed for reduce mean backprop explosion/thinc#572

Merged

polm force-pushed the fix/nel-batch branch from dec8a29 to 55fd0b6 Compare January 17, 2022 10:48

polm mentioned this pull request Jan 18, 2022

Mark flaky spancat test so it doesn't fail the build #10075

Merged

3 tasks

polm added 7 commits January 18, 2022 18:10

Partial fix of entity linker batching

cf05bc6

Add import

483f4ac

Better name

151ec70

Add use_gold_ents option, docs

52faba6

Change to v2, create stub v1, update docs etc.

06121a0

Fix error type

80365fe

Honestly no idea what the right type to use here is. ConfigValidationError seems wrong. Maybe a NotImplementedError?

Make mypy happy

4d01c22

polm marked this pull request as ready for review January 23, 2022 07:05

add failing test for v1 EL legacy architecture

5451ac5

svlandeg previously requested changes Feb 3, 2022

View reviewed changes

polm added 3 commits February 10, 2022 16:37

Add nasty but simple working check for legacy arch

165bb6a

Clarify why init hack works the way it does

ec3b278

Clarify use_gold_ents use case

198a8ca

polm added 2 commits February 15, 2022 16:55

Fix use gold ents related handling

078cfea

Add tests for no gold ents and fix other tests

48b011d

Use aligned ents function (not working)

fa2e51d

This doesn't actually work because the "aligned" ents are gold-only. But if I have a different function that returns the intersection, *then* this will work as desired.

Use proper matching ent check

6770e49

This changes the process when gold ents are not used so that the intersection of ents in the pred and gold is used.

polm added 3 commits February 21, 2022 15:23

Move get_matching_ents to Example

2d63cdb

Use model attribute to check for legacy arch

db45093

Rename flag

87b69b8

bump spacy-legacy to lower 3.0.9

c6fb7c1

svlandeg approved these changes Mar 4, 2022

View reviewed changes

svlandeg merged commit 91acc3e into explosion:master Mar 4, 2022

polm mentioned this pull request Mar 7, 2022

Fix get_matching_ents #10451

Merged

3 tasks

mrriteshranjan added a commit to mrriteshranjan/spaCy that referenced this pull request Mar 11, 2022

Revert "Fix entity linker batching (explosion#9669)"

0a3a67e

This reverts commit 91acc3e.

polm mentioned this pull request Dec 5, 2022

Remove legacy Entity Linker component #11889

Closed

3 tasks

svlandeg mentioned this pull request Mar 27, 2024

Fix use_gold_ents behaviour for EntityLinker #13400

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix entity linker batching #9669

Fix entity linker batching #9669

polm commented Nov 13, 2021

svlandeg left a comment

polm commented Nov 16, 2021

polm commented Nov 20, 2021

polm commented Dec 10, 2021

polm commented Dec 10, 2021

svlandeg left a comment

polm commented Jan 16, 2022

polm commented Jan 16, 2022

svlandeg commented Jan 17, 2022

polm commented Jan 17, 2022 •

edited

Loading

polm commented Jan 17, 2022

svlandeg commented Jan 17, 2022 •

edited

Loading

adrianeboyd commented Jan 17, 2022 •

edited

Loading

svlandeg left a comment

svlandeg Feb 3, 2022

polm Feb 6, 2022

polm Feb 14, 2022

svlandeg Feb 17, 2022 •

edited

Loading

polm commented Feb 14, 2022

polm commented Feb 16, 2022

polm commented Feb 18, 2022

polm commented Feb 18, 2022

svlandeg commented Feb 21, 2022 •

edited

Loading

polm commented Feb 22, 2022

svlandeg commented Feb 22, 2022

svlandeg left a comment

svlandeg commented Mar 4, 2022

		# Handle legacy model
		if model.name == "spacy.EntityLinker.v1":

Fix entity linker batching #9669

Fix entity linker batching #9669

Conversation

polm commented Nov 13, 2021

Description

Types of change

Checklist

svlandeg left a comment

Choose a reason for hiding this comment

polm commented Nov 16, 2021

polm commented Nov 20, 2021

polm commented Dec 10, 2021

polm commented Dec 10, 2021

svlandeg left a comment

Choose a reason for hiding this comment

polm commented Jan 16, 2022

polm commented Jan 16, 2022

svlandeg commented Jan 17, 2022

polm commented Jan 17, 2022 • edited Loading

polm commented Jan 17, 2022

svlandeg commented Jan 17, 2022 • edited Loading

adrianeboyd commented Jan 17, 2022 • edited Loading

svlandeg left a comment

Choose a reason for hiding this comment

svlandeg Feb 3, 2022

Choose a reason for hiding this comment

polm Feb 6, 2022

Choose a reason for hiding this comment

polm Feb 14, 2022

Choose a reason for hiding this comment

svlandeg Feb 17, 2022 • edited Loading

Choose a reason for hiding this comment

polm commented Feb 14, 2022

polm commented Feb 16, 2022

polm commented Feb 18, 2022

polm commented Feb 18, 2022

svlandeg commented Feb 21, 2022 • edited Loading

polm commented Feb 22, 2022

svlandeg commented Feb 22, 2022

svlandeg left a comment

Choose a reason for hiding this comment

svlandeg commented Mar 4, 2022

polm commented Jan 17, 2022 •

edited

Loading

svlandeg commented Jan 17, 2022 •

edited

Loading

adrianeboyd commented Jan 17, 2022 •

edited

Loading

svlandeg Feb 17, 2022 •

edited

Loading

svlandeg commented Feb 21, 2022 •

edited

Loading