Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tokenization mismatch handling in coref #11042

Merged
merged 19 commits into from
Jul 11, 2022

Conversation

polm
Copy link
Contributor

@polm polm commented Jun 28, 2022

Description

Coref was written without any consideration for mismatches in tokenization between the gold and predicted docs. This PR makes it so such issues are handled correctly.

One consideration here is, what to do when a target token (the head of a mention span) isn't shared between the two tokenizations? It may be desirable in some cases to have a mitigation strategy, like the expand/contract strategy present in doc.char_span. However at present this PR just throws an error if a target can't be mapped.

This is still in progress.

Types of change

Bug fix.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

polm added 3 commits June 28, 2022 19:05
This runs, but the results are nonsense because the indices are off.
This changes the tok2vec size in coref to hardcoded 64 to get tests to
run. This should be reverted and hopefully replaced with proper shape
inference.
This may not be done yet, as the test is just for consistency, and not
overfitting correctly yet.
@polm polm added the feat / coref Feature: Coreference resolution label Jun 28, 2022
@polm polm requested a review from kadarakos June 29, 2022 10:03
polm added 10 commits July 1, 2022 19:09
This test only fails due to the explicity assert False at the moment,
but the debug output shows that the learned spans are all off by one due
to misalignment. So the code still needs fixing.
I believe this resolves issues with tokenization mismatches.
@polm polm marked this pull request as ready for review July 3, 2022 11:17
@polm polm added the bug Bugs and behaviour differing from documentation label Jul 3, 2022
@polm
Copy link
Contributor Author

polm commented Jul 3, 2022

@explosion-bot Please test_slow_gpu

@polm
Copy link
Contributor Author

polm commented Jul 3, 2022

I believe this should be working well enough to merge now.

There's still a question of what to do when annotations don't have compatible boundaries between gold and predicted Docs. Currently coref gives up and the span predictor will ignore those spans (zero gradients); I don't have strong feelings about which of these are better. As mentioned in the initial post here I could also see offering mitigation strategies make sense (I feel like expand would usually be appropriate), though it would make the component more complicated.

@adrianeboyd
Copy link
Contributor

There is one remaining catch with using character offsets with Example, and that's that the whitespace isn't required to align between the predicted and reference docs, so working with character offsets could lead to unexpected/incorrect results.

eg = Example(nlp.make_doc("\n\nThis   is  an    example."), nlp("This is an example."))

I think this is rare in practice and it's okay if the component can't handle it, but there should be explicit errors in all the relevant spots that this component can't handle this kind of example.

@polm
Copy link
Contributor Author

polm commented Jul 4, 2022

Oh wow, I was not aware that was allowed. I'll work on adding checks that the character spans have the same contents after being translated from the reference to the predicted doc.

@adrianeboyd
Copy link
Contributor

I think you could just check that the texts of the predicted and reference docs are the same? (You could also lowercase them if you wanted to allow little more of the original leeway.)

The actual requirement is:

# Check that the two texts only differ in whitespace and capitalization
if re.sub(r"\s+", "", str_a) != re.sub(r"\s+", "", str_b) or \
len_str_a != len(char_to_token_a) or \
len_str_b != len(char_to_token_b):
raise ValueError(Errors.E949.format(x=str(A[:10]), y=str(B[:10])))

(We ran into problems at one point because Turkish dotless i is not the same number of characters uppercase and lowercase.)

@polm
Copy link
Contributor Author

polm commented Jul 4, 2022

Thanks, that makes it clear, and checking up front would definitely be easier. I was thinking that if the checks were only in the required places it would make them easier to remove / fix later, but it's probably not that big a difference.

polm added 2 commits July 4, 2022 19:28
Docs in Examples are allowed to have arbitrarily different whitespace.
Handling that properly would be nice but isn't required, but for now
check for it and blow up.
@polm
Copy link
Contributor Author

polm commented Jul 4, 2022

OK, the point about differing whitespace should be addressed now.

@polm polm requested a review from adrianeboyd July 5, 2022 05:20
@adrianeboyd
Copy link
Contributor

@explosion-bot please test_gpu

@explosion-bot
Copy link
Collaborator

⏳ Triggering 'GPU Tests' for 'spaCy' on Buildkite...

@explosion-bot
Copy link
Collaborator

🚨 Errors

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/testing.py", line 329, in invoke
    cli.main(args=args or (), prog_name=prog_name, **extra)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/explosionbot/github_commands/test.py", line 174, in test_gpu
    test_command(repo_name, test_suite, thinc_branch, spacy_branch, run_on)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/explosionbot/github_commands/test.py", line 139, in test_command
    branch = get_pr_branch()
NameError: name 'get_pr_branch' is not defined

@adrianeboyd
Copy link
Contributor

@explosion-bot please test_gpu

@explosion-bot
Copy link
Collaborator

explosion-bot commented Jul 5, 2022

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/97

spacy/ml/models/coref_util.py Outdated Show resolved Hide resolved
spacy/ml/models/coref_util.py Show resolved Hide resolved
spacy/ml/models/coref_util.py Outdated Show resolved Hide resolved
spacy/ml/models/coref_util.py Outdated Show resolved Hide resolved
n_classes = start_scores.shape[1]
start_probs = ops.softmax(start_scores, axis=1)
end_probs = ops.softmax(end_scores, axis=1)
start_targets = to_categorical(starts, n_classes)
end_targets = to_categorical(ends, n_classes)
start_grads = start_probs - start_targets
end_grads = end_probs - end_targets
grads = ops.xp.stack((start_grads, end_grads), axis=2)
# now return to original shape, with 0s
final_start_grads = ops.alloc2f(*span_scores[:, :, 0].shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me it might be clearer if it used the to me more familiar method of having a mask instead like grads * mask where the mask is 1 for keeps and 0 otherwise, but I think this does the same thing actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the array was full size, with fields that were to be masked out later, wouldn't that change the softmax caculation? Besides that I think these are equivalent.

spacy/pipeline/coref.py Show resolved Hide resolved
spacy/ml/models/coref_util.py Show resolved Hide resolved
spacy/ml/models/coref_util.py Show resolved Hide resolved
spacy/ml/models/coref_util.py Show resolved Hide resolved

for i in range(5):
# Needs ~12 epochs to converge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this test can be sped up by setting the learn_rate of Adam to 1.0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might work, but running the whole file takes 5s on my machine, so I'm not sure it's all that slow anyway?

spacy/tests/pipeline/test_coref.py Show resolved Hide resolved
polm and others added 3 commits July 6, 2022 13:46
Co-authored-by: kadarakos <kadar.akos@gmail.com>
Basically the same as get_clusters_from_doc
@polm polm mentioned this pull request Jul 6, 2022
3 tasks
@polm polm requested a review from kadarakos July 6, 2022 10:44
@polm polm merged commit 9cbb970 into explosion:feature/coref Jul 11, 2022
@polm
Copy link
Contributor Author

polm commented Jul 11, 2022

Merged this to get all the coref changes in one place rather than having several small PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / coref Feature: Coreference resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants