-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix entity linker batching #9669
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this Paul!
About the listener architecture: I think your proposed changes make sense (I'll look into a bit more detail when this is out of draft), though in general we should think about what should happen in these cases. As you point out, there is an implicit assumption here on how listeners should be used. We should either document that, or look into a kind of fall-back mechanism when the listener's cache fails. We can probably defer that to a future discussion/PR though.
About the entities: conceptually, I think we should allow either training the EL on gold entities (as before), or on predicted ones. The latter is perfectly possible with the recent "annotating components" feature. The EL should then take in a parameter that defines whether gold should be used (True
by default) or no. In the latter case, the user needs to put the NER or any other component that writes to doc.ents
in annotating_components
. We can't really check this, though we can output a warning if annotating_components
is empty at least.
I hadn't considered the possibility of adding a parameter, that sounds great. I'll work on doing that and I think with that change this can be out of draft. For the broader issue, I think a doc note can be added on the current behavior pretty easily, and we can discuss whether the current requirements are desirable or not later. |
Failing tests are because for tests that use the default In initialization specifically we can check if there are no NER annotations and add some fake ones so the network can run, I'm not sure if that's the right approach though. |
So the tests have been failing because the changes in architecture affect the loss calculation too. In the old version of the component, there would be one Doc per gold entity with KB entry in the input to the loss function. The loss function calculates an embedding for each gold entity and compares those. The issue is that in the revised version, the model doesn't know what's gold and not, it just uses all entities. This means that it gets predictions for NIL entities, which the loss function doesn't calculate a value for. So that has to be resolved somehow. I guess these are the approaches:
I tried 2 first, but the loss was too high and the test consistently failed. My most recent commit takes approach 1 and it passes - but only sometimes. Sometimes the loss is a little too high, and more rarely the loss goes to NaN. I'm not sure why it's not reproducible or why NaN would come up - seems there's some kind of overflow?
I am not sure about approach 3, I need to think about it more. I guess the model can use the |
Thinking about approach 3 some more, I don't think it'll work. The model doesn't know if an entity has a KB annotation or not (because that would imply cheating). It also doesn't currently have access to the get_candidates function, so it can't check that (the function is only in the component). What we could do is, the component could "erase" entities that don't have KB annotations, or that would get NIL annotations... but I'm not sure how that would interact with other pipeline components. I guess we could restore entity annotations but then that gets messy. I think there's nothing wrong in principle with the approach I'm taking now, though I'm not sure what's up with the math - even if it's wrong, I'd expect it to be consistent on CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this Paul. We need to dig into the nan
issue - I left some review comments. I think the key is in the missing backprop.
In general I'd like to see more unit tests here: test edge cases with no entities (predicted or gold), test with different scenarios of the use_gold_ents
etc. It would also be good to have a specific test for both the legacy v1 as well as the new v2 architecture, to ensure both get initialized well, can train one or two steps, and can predict. The test can be written by loading an nlp
from a minimal config that is filled with defaults depending on either architecture.
Some additional comments to your questions:
What happens when no entities are available? This can happen with unlucky batching if enough training docs don't have entities. With no entities some of the calculations just don't work out. Maybe we need to provide some filler data?
Is there a way to spot this very early on, like we do have empty docs, and prevent the neural network from predicting at all?
I guess these are the approaches:
- In the loss calculation, ignore predictions for known NIL entities (my most recent commit tries this)
- In the loss calculation, add empty/placeholder embeddings for known NIL entities (I tried this but the loss was too high for the test)
- In training, remove entity annotations for gold entities with no KB ID (not tried yet)
The problem is that the EL model doesn't try to learn the ID for a given entity, it tries to get the embedding of the sentence as close as it can to the entity embedding. There can be no universal NIL embedding - that will mess up normal embeddings too much. So 2. won't work.
I think 1. should work though: similar to how we would mask missing information when calculating textcat loss, we should mask the NIL entity and not derive loss values from it.
I think this should go green this time. The issue was with the gradient for entities not in the kb. A gradient can't actually be calculated for them, and the model was only returning gradients it could calculate. But because the model doesn't have information about which entities are in the kb, that means there could be a mismatch in the number of calculated embeddings and the size of the gradient. A mismatch like that should probably cause an error, but it seems it may have been accessing out of bounds memory or doing something else weird. So that's where the nans came from and is another problem that needs a followup. |
Behavior of this with numpyops seems wrong, as instead of giving an error it produces nans, as though it's accessing out of bounds memory or something. See explosion/spaCy#9669.
Test failure is about garbage collection in textcat...? Might need to update from master, will look more at this later. |
That failing spancat test is really weird. I see that there are some commits added to this branch that aren't supposed to be there. Maybe rebase to the current |
Test still fails here, but it passes locally for me. Not sure what's up... (Also you're right it's spancat, I mistakenly said textcat before.) EDIT: Ah nevermind, fails locally too. Hm... |
Looking at the failing test in detail, it looks like it is flaky in master - I ran it 20 times on master and it failed 4 times. Not really sure what's going on here... |
That's really weird. Could you create a seperate PR, disable the flaky test, and create a follow-up ticket on our internal board to look into this in the near future? Then we can continue with this PR after the test has been disabled on master. |
This looks bizarre. My bet is new version of numpy (1.22.0), but we should figure out why. Edited: No, not numpy (which would have been weird, to be honest). It's not really a good example. I thought there was brief train step in there, but it looks like this hasn't changed at all recently. Something in |
Honestly no idea what the right type to use here is. ConfigValidationError seems wrong. Maybe a NotImplementedError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a failing test for the v1 legacy architecture, I tested this locally with the corresponding legacy branch, but it fails because EntityLinker_v1
is never returned because model.name
isn't matched.
Similarly, when running the NEL emerson project, the v2 architecture is returned. It looks like the new v2 code just runs in that case, probably because the data is extremely simple. But it seems to run more by coincidence because the legacy fallback isn't actually used.
spacy/pipeline/entity_linker.py
Outdated
# Handle legacy model | ||
if model.name == "spacy.EntityLinker.v1": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you verify that this works? I don't really see how it could, the model.name
is never set to that specific string as far as I can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I completely overlooked this - it's clear looking at it now it won't work at all, since model.name
is the automatically constructed Thinc name.
However looking over this I'm not sure the architecture name is available at all at this point. It's not in the model, it's not in the arguments here, and nlp.config
doesn't seem to contain the config for the current component (and peeking at that would be nasty anyway).
I need to think about this more, but can we even check the architecture here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My solution for the issue here is to look for the name of the new custom component that pulls out the spans in the thinc model name. If it's not there, I assume it's the old model.
That feels awfully messy, but I think it's the cleanest way to check things here, since the model name isn't actually available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would work for now, but the solution does feel a bit brittle & implicit, making it prone for bugs in the future.
I'm almost tempted to define a new attr
that would record the factory, maybe something like
model.attrs["spacy_factory"] = {"EntityLinker": 2}
And then any internal check could verify the version, and if the right attribute is not available, assume it's below whatever you're checking, because it's a legacy one. In time I think this might be the most robust/clear solution, but it would require maintenance and making sure things stay consistent whenever we introduce a new model version...
Still working on this - in order to get |
So this works, and I have tests for it, but the code is kind of a mess. The issue is that I need aligned entities but I'm not sure that any of the existing alignment functions do what I need. The issue is that the model produces an Entity Linker prediction for each entity on the predicted document. Since those predictions are just a list of predicted embeddings, in order to filter them it's not enough to get the aligned entities, the index relative to the predicted embeddings is necessary. I think I can make this cleaner by getting the aligned entities before the forward pass, setting them on the predicted doc temporarily, and then restoring the predicted entities, but I need to check it. If that works it'll have the nice side effect of not doing calculations in the forward pass for instances that can't be backpropagated. |
This doesn't actually work because the "aligned" ents are gold-only. But if I have a different function that returns the intersection, *then* this will work as desired.
This changes the process when gold ents are not used so that the intersection of ents in the pred and gold is used.
OK, I have added a
|
Tests are now failing because I'm not really sure how to handle that... We could do a dev release of |
Why should we have to rip it out again? IIRC there's no issue when a function is declared in both Actually, you can go ahead and test locally whether this gives any issues when you have explosion/spacy-legacy#18 installed next to spaCy's |
I have been testing it locally - with the patched version of I thought that maybe using a special release of If it's not too complicated we can do the test release and update the dependencies in this branch. |
Sorry, I should have been more clear. Ideally, we can just go ahead with a proper release of |
* Add assert that size hasn't changed for reduce mean backprop Behavior of this with numpyops seems wrong, as instead of giving an error it produces nans, as though it's accessing out of bounds memory or something. See explosion/spaCy#9669. * Add clean checks for all reduce functions This also saves the size in a variable so the actual data can be garbage collected. * Add minimal test There are no tests for most of the reduce layers... * Add size consistency check to softmax I think the math won't work out anyway if this is inconsistent, but checking here makes the cause of errors more explicit. * Use a decorator for consistent backprop This is cleaner than copying the other code. Maybe this decorator could go on the top of a layer too to be even cleaner? * Make backprop decorator work on forward This simplifies things a bit at the point of usage. * Add tests for all reduce methods * Update thinc/util.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * New approach to consistency checking Currently only in reduce_mean while I get feedback * Restructure arrayinfo creation * Formatting * Replace decorator with ArrayInfo everywhere * Update thinc/util.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename ainfo var * Relax typing * Change dtype to not pick up name * Remove leftover reference Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a little more involved than we'd expected at the beginning, but with the new release of spacy-legacy
, this is now finally good to merge 🎉
Nice work @polm ! |
This reverts commit 91acc3e.
Description
The Entity Linker doesn't work with Listeners because it violates an (implicit?) requirement that the batch passed to the pipeline
update
must be the same as the batch passed to the model'sforward
usingbegin_update
.The reason the Entity Linker does this is that it builds sentence docs to get context around relevant entities. The reason this wasn't obviously a problem is that the EntityLinker predates the use of Listeners in spacy.
This PR changes the structure of the EntityLinker so that the batch is handled consistently with other components and works with Listeners. This moves the logic of building context inside the model.
There are a few issues with this that are still unresolved. The main one is that the EntityLinker only makes a prediction if entities are available. This has two issues.
With the first commit here, training an NER and NEL model together using listeners is possible using this user-provided test repo. But in order for this to work entities are unconditionally copied over from reference Docs when preparing the batch, which may not be desirable when training with NER, and is not really what we want for a final fix.
Types of change
Bug fix. Brought up in these issues: #9310, #9291.
#9575 seems to be a separate problem but it might make sense to fix it in this PR too.
Checklist