Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix entity linker batching #9669

Merged
merged 30 commits into from
Mar 4, 2022
Merged

Fix entity linker batching #9669

merged 30 commits into from
Mar 4, 2022

Conversation

polm
Copy link
Contributor

@polm polm commented Nov 13, 2021

Description

The Entity Linker doesn't work with Listeners because it violates an (implicit?) requirement that the batch passed to the pipeline update must be the same as the batch passed to the model's forward using begin_update.

The reason the Entity Linker does this is that it builds sentence docs to get context around relevant entities. The reason this wasn't obviously a problem is that the EntityLinker predates the use of Listeners in spacy.

This PR changes the structure of the EntityLinker so that the batch is handled consistently with other components and works with Listeners. This moves the logic of building context inside the model.

There are a few issues with this that are still unresolved. The main one is that the EntityLinker only makes a prediction if entities are available. This has two issues.

  1. What happens when no entities are available? This can happen with unlucky batching if enough training docs don't have entities. With no entities some of the calculations just don't work out. Maybe we need to provide some filler data?
  2. How do we get the entities? The old code checked gold data for entities and used that to create candidates. In the model we don't have Examples, just Docs without gold data. So we need to either have the Entities set earlier in the pipeline, or copy them from gold docs. Which do we do, and do we make that a config option, or partly automate it, or something else?

With the first commit here, training an NER and NEL model together using listeners is possible using this user-provided test repo. But in order for this to work entities are unconditionally copied over from reference Docs when preparing the batch, which may not be desirable when training with NER, and is not really what we want for a final fix.

Types of change

Bug fix. Brought up in these issues: #9310, #9291.

#9575 seems to be a separate problem but it might make sense to fix it in this PR too.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@polm polm added bug Bugs and behaviour differing from documentation feat / nel Feature: Named Entity linking labels Nov 13, 2021
@polm polm requested a review from svlandeg November 15, 2021 10:58
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this Paul!

About the listener architecture: I think your proposed changes make sense (I'll look into a bit more detail when this is out of draft), though in general we should think about what should happen in these cases. As you point out, there is an implicit assumption here on how listeners should be used. We should either document that, or look into a kind of fall-back mechanism when the listener's cache fails. We can probably defer that to a future discussion/PR though.

About the entities: conceptually, I think we should allow either training the EL on gold entities (as before), or on predicted ones. The latter is perfectly possible with the recent "annotating components" feature. The EL should then take in a parameter that defines whether gold should be used (True by default) or no. In the latter case, the user needs to put the NER or any other component that writes to doc.ents in annotating_components. We can't really check this, though we can output a warning if annotating_components is empty at least.

spacy/ml/models/entity_linker.py Outdated Show resolved Hide resolved
@polm
Copy link
Contributor Author

polm commented Nov 16, 2021

The EL should then take in a parameter that defines whether gold should be used (True by default) or no.

I hadn't considered the possibility of adding a parameter, that sounds great. I'll work on doing that and I think with that change this can be out of draft.

For the broader issue, I think a doc note can be added on the current behavior pretty easily, and we can discuss whether the current requirements are desirable or not later.

@polm
Copy link
Contributor Author

polm commented Nov 20, 2021

Failing tests are because for tests that use the default get_candidates for initialization, there are no entities on the example docs. Because there are no entities, no candidates for annotations are produced, and initialization fails.

In initialization specifically we can check if there are no NER annotations and add some fake ones so the network can run, I'm not sure if that's the right approach though.

@polm
Copy link
Contributor Author

polm commented Dec 10, 2021

So the tests have been failing because the changes in architecture affect the loss calculation too.

In the old version of the component, there would be one Doc per gold entity with KB entry in the input to the loss function. The loss function calculates an embedding for each gold entity and compares those.

The issue is that in the revised version, the model doesn't know what's gold and not, it just uses all entities. This means that it gets predictions for NIL entities, which the loss function doesn't calculate a value for. So that has to be resolved somehow.

I guess these are the approaches:

  1. In the loss calculation, ignore predictions for known NIL entities (my most recent commit tries this)
  2. In the loss calculation, add empty/placeholder embeddings for known NIL entities (I tried this but the loss was too high for the test)
  3. In training, remove entity annotations for gold entities with no KB ID (not tried yet)

I tried 2 first, but the loss was too high and the test consistently failed.

My most recent commit takes approach 1 and it passes - but only sometimes. Sometimes the loss is a little too high, and more rarely the loss goes to NaN. I'm not sure why it's not reproducible or why NaN would come up - seems there's some kind of overflow?

spacy/tests/pipeline/test_entity_linker.py::test_overfitting_IO
  /mnt/pool/code/spacy/env/lib/python3.9/site-packages/thinc/layers/layernorm.py:32: RuntimeWarning: overflow encountered in multiply
    d_xhat = N * dY - sum_dy - dist * var ** (-1.0) * sum_dy_dist

spacy/tests/pipeline/test_entity_linker.py::test_overfitting_IO
  /mnt/pool/code/spacy/env/lib/python3.9/site-packages/numpy/core/_methods.py:47: RuntimeWarning: invalid value encountered in reduce
    return umr_sum(a, axis, dtype, out, keepdims, initial, where)

spacy/tests/pipeline/test_entity_linker.py::test_overfitting_IO
  /mnt/pool/code/spacy/env/lib/python3.9/site-packages/numpy/core/fromnumeric.py:87: RuntimeWarning: invalid value encountered in reduce
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

spacy/tests/pipeline/test_entity_linker.py::test_overfitting_IO
  /mnt/pool/code/spacy/env/lib/python3.9/site-packages/thinc/backends/ops.py:755: RuntimeWarning: invalid value encountered in multiply
    gradient *= threshold / grad_norm

-- Docs: https://docs.pytest.org/en/stable/warnings.html

I am not sure about approach 3, I need to think about it more. I guess the model can use the get_candidates function to exclude entities from predictions, so maybe that's the right thing to do.

@polm
Copy link
Contributor Author

polm commented Dec 10, 2021

Thinking about approach 3 some more, I don't think it'll work. The model doesn't know if an entity has a KB annotation or not (because that would imply cheating). It also doesn't currently have access to the get_candidates function, so it can't check that (the function is only in the component).

What we could do is, the component could "erase" entities that don't have KB annotations, or that would get NIL annotations... but I'm not sure how that would interact with other pipeline components. I guess we could restore entity annotations but then that gets messy.

I think there's nothing wrong in principle with the approach I'm taking now, though I'm not sure what's up with the math - even if it's wrong, I'd expect it to be consistent on CPU.

@svlandeg svlandeg self-requested a review December 27, 2021 13:11
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this Paul. We need to dig into the nan issue - I left some review comments. I think the key is in the missing backprop.

In general I'd like to see more unit tests here: test edge cases with no entities (predicted or gold), test with different scenarios of the use_gold_ents etc. It would also be good to have a specific test for both the legacy v1 as well as the new v2 architecture, to ensure both get initialized well, can train one or two steps, and can predict. The test can be written by loading an nlp from a minimal config that is filled with defaults depending on either architecture.

Some additional comments to your questions:

What happens when no entities are available? This can happen with unlucky batching if enough training docs don't have entities. With no entities some of the calculations just don't work out. Maybe we need to provide some filler data?

Is there a way to spot this very early on, like we do have empty docs, and prevent the neural network from predicting at all?

I guess these are the approaches:

  1. In the loss calculation, ignore predictions for known NIL entities (my most recent commit tries this)
  2. In the loss calculation, add empty/placeholder embeddings for known NIL entities (I tried this but the loss was too high for the test)
  3. In training, remove entity annotations for gold entities with no KB ID (not tried yet)

The problem is that the EL model doesn't try to learn the ID for a given entity, it tries to get the embedding of the sentence as close as it can to the entity embedding. There can be no universal NIL embedding - that will mess up normal embeddings too much. So 2. won't work.

I think 1. should work though: similar to how we would mask missing information when calculating textcat loss, we should mask the NIL entity and not derive loss values from it.

spacy/ml/models/entity_linker.py Show resolved Hide resolved
spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved
spacy/pipeline/entity_linker.py Show resolved Hide resolved
@polm
Copy link
Contributor Author

polm commented Jan 16, 2022

I think this should go green this time.

The issue was with the gradient for entities not in the kb. A gradient can't actually be calculated for them, and the model was only returning gradients it could calculate. But because the model doesn't have information about which entities are in the kb, that means there could be a mismatch in the number of calculated embeddings and the size of the gradient.

A mismatch like that should probably cause an error, but it seems it may have been accessing out of bounds memory or doing something else weird. So that's where the nans came from and is another problem that needs a followup.

polm added a commit to polm/thinc that referenced this pull request Jan 16, 2022
Behavior of this with numpyops seems wrong, as instead of giving an
error it produces nans, as though it's accessing out of bounds memory or
something. See explosion/spaCy#9669.
@polm
Copy link
Contributor Author

polm commented Jan 16, 2022

Test failure is about garbage collection in textcat...? Might need to update from master, will look more at this later.

@svlandeg
Copy link
Member

That failing spancat test is really weird. I see that there are some commits added to this branch that aren't supposed to be there. Maybe rebase to the current master?

@polm
Copy link
Contributor Author

polm commented Jan 17, 2022

Test still fails here, but it passes locally for me. Not sure what's up... (Also you're right it's spancat, I mistakenly said textcat before.)

EDIT: Ah nevermind, fails locally too. Hm...

@polm
Copy link
Contributor Author

polm commented Jan 17, 2022

Looking at the failing test in detail, it looks like it is flaky in master - I ran it 20 times on master and it failed 4 times. Not really sure what's going on here...

@svlandeg
Copy link
Member

svlandeg commented Jan 17, 2022

That's really weird. Could you create a seperate PR, disable the flaky test, and create a follow-up ticket on our internal board to look into this in the near future? Then we can continue with this PR after the test has been disabled on master.

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Jan 17, 2022

This looks bizarre. My bet is new version of numpy (1.22.0), but we should figure out why.

Edited: No, not numpy (which would have been weird, to be honest). It's not really a good example.

I thought there was brief train step in there, but it looks like this hasn't changed at all recently. Something in thinc?

@polm polm marked this pull request as ready for review January 23, 2022 07:05
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a failing test for the v1 legacy architecture, I tested this locally with the corresponding legacy branch, but it fails because EntityLinker_v1 is never returned because model.name isn't matched.

Similarly, when running the NEL emerson project, the v2 architecture is returned. It looks like the new v2 code just runs in that case, probably because the data is extremely simple. But it seems to run more by coincidence because the legacy fallback isn't actually used.

Comment on lines 96 to 97
# Handle legacy model
if model.name == "spacy.EntityLinker.v1":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you verify that this works? I don't really see how it could, the model.name is never set to that specific string as far as I can see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I completely overlooked this - it's clear looking at it now it won't work at all, since model.name is the automatically constructed Thinc name.

However looking over this I'm not sure the architecture name is available at all at this point. It's not in the model, it's not in the arguments here, and nlp.config doesn't seem to contain the config for the current component (and peeking at that would be nasty anyway).

I need to think about this more, but can we even check the architecture here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My solution for the issue here is to look for the name of the new custom component that pulls out the spans in the thinc model name. If it's not there, I assume it's the old model.

That feels awfully messy, but I think it's the cleanest way to check things here, since the model name isn't actually available.

Copy link
Member

@svlandeg svlandeg Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work for now, but the solution does feel a bit brittle & implicit, making it prone for bugs in the future.

I'm almost tempted to define a new attr that would record the factory, maybe something like

model.attrs["spacy_factory"] = {"EntityLinker": 2}

And then any internal check could verify the version, and if the right attribute is not available, assume it's below whatever you're checking, because it's a legacy one. In time I think this might be the most robust/clear solution, but it would require maintenance and making sure things stay consistent whenever we introduce a new model version...

@polm
Copy link
Contributor Author

polm commented Feb 14, 2022

Still working on this - in order to get use_gold_ents = False working I had to redo the code that aligned predictions and gold KB embeddings, since it relied on the assumption that there was exactly one prediction per gold entity. I figured it out but need to clean it up and write tests for situations where the alignment actually comes up.

@polm
Copy link
Contributor Author

polm commented Feb 16, 2022

So this works, and I have tests for it, but the code is kind of a mess.

The issue is that I need aligned entities but I'm not sure that any of the existing alignment functions do what I need. The issue is that the model produces an Entity Linker prediction for each entity on the predicted document. Since those predictions are just a list of predicted embeddings, in order to filter them it's not enough to get the aligned entities, the index relative to the predicted embeddings is necessary.

I think I can make this cleaner by getting the aligned entities before the forward pass, setting them on the predicted doc temporarily, and then restoring the predicted entities, but I need to check it. If that works it'll have the nice side effect of not doing calculations in the forward pass for instances that can't be backpropagated.

This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.
This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.
@polm
Copy link
Contributor Author

polm commented Feb 18, 2022

OK, I have added a get_matching_ents function that returns only entities that are aligned between the predicted and reference docs in a function. That's the set of entities that we can backprop for. Now I have some design questions about this.

  1. Is get_matching_ents a reasonable addition to the Example? Is there a better name for it? Not sure how to differentiate it from "aligned" ents more clearly...
  2. When getting matching ents in the specific case of the EntityLinker, should it be necessary that labels of ents match as well as boundaries? My instinct is "no", but I don't have a strong justification for that. We could make it a parameter for the user but I feel like it's not a very interesting one.
  3. There is a situation with the internal state in the EntityLinker that I don't know how to test for. If alignment is done incorrectly but the number of entities matches between predicted and gold docs, then the math will work and training can run but the results will be meaningless.

@polm
Copy link
Contributor Author

polm commented Feb 18, 2022

Tests are now failing because EntityLinker.v1 isn't available in the registry, because the PR adding it to legacy needs to be coordinated with this.

I'm not really sure how to handle that... We could do a dev release of spacy-legacy and have this PR branch use it but that seems like too many moving parts, and it'll have to be ripped out anyway. I could put in a placeholder but then the test isn't meaningful. I guess we could also just xfail it until spacy-legacy is updated...?

@svlandeg
Copy link
Member

svlandeg commented Feb 21, 2022

I'm not really sure how to handle that... We could do a dev release of spacy-legacy and have this PR branch use it but that seems like too many moving parts, and it'll have to be ripped out anyway.

Why should we have to rip it out again? IIRC there's no issue when a function is declared in both spacy and spacy-legacy - the former should take precedence. And either way they should be the same thing, right?

Actually, you can go ahead and test locally whether this gives any issues when you have explosion/spacy-legacy#18 installed next to spaCy's master branch, and then try running some code with v1. Should hopefully be fine 🤞

@polm
Copy link
Contributor Author

polm commented Feb 22, 2022

I have been testing it locally - with the patched version of spacy-legacy it works fine.

I thought that maybe using a special release of spacy-legacy in this PR branch would be too complicated, but on reflection I guess there's no issue with it - I may have misunderstood how resolution between spacy and spacy-legacy registries worked.

If it's not too complicated we can do the test release and update the dependencies in this branch.

@svlandeg
Copy link
Member

Sorry, I should have been more clear. Ideally, we can just go ahead with a proper release of spacy-legacy including the two open PRs there at the moment. This shouldn't interfere with the current, normal master branch in spacy. Then that release is just a normal one, not a "test" one, and won't need to be pulled (because I agree that would be a hassle)

svlandeg added a commit to explosion/thinc that referenced this pull request Mar 3, 2022
* Add assert that size hasn't changed for reduce mean backprop

Behavior of this with numpyops seems wrong, as instead of giving an
error it produces nans, as though it's accessing out of bounds memory or
something. See explosion/spaCy#9669.

* Add clean checks for all reduce functions

This also saves the size in a variable so the actual data can be garbage
collected.

* Add minimal test

There are no tests for most of the reduce layers...

* Add size consistency check to softmax

I think the math won't work out anyway if this is inconsistent, but
checking here makes the cause of errors more explicit.

* Use a decorator for consistent backprop

This is cleaner than copying the other code. Maybe this decorator could
go on the top of a layer too to be even cleaner?

* Make backprop decorator work on forward

This simplifies things a bit at the point of usage.

* Add tests for all reduce methods

* Update thinc/util.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* New approach to consistency checking

Currently only in reduce_mean while I get feedback

* Restructure arrayinfo creation

* Formatting

* Replace decorator with ArrayInfo everywhere

* Update thinc/util.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename ainfo var

* Relax typing

* Change dtype to not pick up name

* Remove leftover reference

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned out to be a little more involved than we'd expected at the beginning, but with the new release of spacy-legacy, this is now finally good to merge 🎉

@svlandeg
Copy link
Member

svlandeg commented Mar 4, 2022

Nice work @polm !

@svlandeg svlandeg merged commit 91acc3e into explosion:master Mar 4, 2022
@polm polm mentioned this pull request Mar 7, 2022
3 tasks
mrriteshranjan added a commit to mrriteshranjan/spaCy that referenced this pull request Mar 11, 2022
@polm polm mentioned this pull request Dec 5, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / nel Feature: Named Entity linking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants