-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tokenization mismatch handling in coref #11042
Conversation
This runs, but the results are nonsense because the indices are off.
This changes the tok2vec size in coref to hardcoded 64 to get tests to run. This should be reverted and hopefully replaced with proper shape inference.
This may not be done yet, as the test is just for consistency, and not overfitting correctly yet.
Had to renumber error message.
This test only fails due to the explicity assert False at the moment, but the debug output shows that the learned spans are all off by one due to misalignment. So the code still needs fixing.
I believe this resolves issues with tokenization mismatches.
@explosion-bot Please test_slow_gpu |
I believe this should be working well enough to merge now. There's still a question of what to do when annotations don't have compatible boundaries between gold and predicted Docs. Currently coref gives up and the span predictor will ignore those spans (zero gradients); I don't have strong feelings about which of these are better. As mentioned in the initial post here I could also see offering mitigation strategies make sense (I feel like |
There is one remaining catch with using character offsets with eg = Example(nlp.make_doc("\n\nThis is an example."), nlp("This is an example.")) I think this is rare in practice and it's okay if the component can't handle it, but there should be explicit errors in all the relevant spots that this component can't handle this kind of example. |
Oh wow, I was not aware that was allowed. I'll work on adding checks that the character spans have the same contents after being translated from the reference to the predicted doc. |
I think you could just check that the texts of the predicted and reference docs are the same? (You could also lowercase them if you wanted to allow little more of the original leeway.) The actual requirement is: spaCy/spacy/training/align.pyx Lines 16 to 20 in 7c1bf2f
(We ran into problems at one point because Turkish dotless i is not the same number of characters uppercase and lowercase.) |
Thanks, that makes it clear, and checking up front would definitely be easier. I was thinking that if the checks were only in the required places it would make them easier to remove / fix later, but it's probably not that big a difference. |
Docs in Examples are allowed to have arbitrarily different whitespace. Handling that properly would be nice but isn't required, but for now check for it and blow up.
OK, the point about differing whitespace should be addressed now. |
@explosion-bot please test_gpu |
|
🚨 Errors
|
@explosion-bot please test_gpu |
URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/97 |
n_classes = start_scores.shape[1] | ||
start_probs = ops.softmax(start_scores, axis=1) | ||
end_probs = ops.softmax(end_scores, axis=1) | ||
start_targets = to_categorical(starts, n_classes) | ||
end_targets = to_categorical(ends, n_classes) | ||
start_grads = start_probs - start_targets | ||
end_grads = end_probs - end_targets | ||
grads = ops.xp.stack((start_grads, end_grads), axis=2) | ||
# now return to original shape, with 0s | ||
final_start_grads = ops.alloc2f(*span_scores[:, :, 0].shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me it might be clearer if it used the to me more familiar method of having a mask instead like grads * mask
where the mask is 1
for keeps
and 0
otherwise, but I think this does the same thing actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the array was full size, with fields that were to be masked out later, wouldn't that change the softmax caculation? Besides that I think these are equivalent.
|
||
for i in range(5): | ||
# Needs ~12 epochs to converge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this test can be sped up by setting the learn_rate
of Adam
to 1.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might work, but running the whole file takes 5s on my machine, so I'm not sure it's all that slow anyway?
Co-authored-by: kadarakos <kadar.akos@gmail.com>
Basically the same as get_clusters_from_doc
Merged this to get all the coref changes in one place rather than having several small PRs. |
Description
Coref was written without any consideration for mismatches in tokenization between the gold and predicted docs. This PR makes it so such issues are handled correctly.
One consideration here is, what to do when a target token (the head of a mention span) isn't shared between the two tokenizations? It may be desirable in some cases to have a mitigation strategy, like the expand/contract strategy present in
doc.char_span
. However at present this PR just throws an error if a target can't be mapped.This is still in progress.
Types of change
Bug fix.
Checklist