Fix tokenization mismatch handling in coref #11042

polm · 2022-06-28T10:19:39Z

Description

Coref was written without any consideration for mismatches in tokenization between the gold and predicted docs. This PR makes it so such issues are handled correctly.

One consideration here is, what to do when a target token (the head of a mention span) isn't shared between the two tokenizations? It may be desirable in some cases to have a mitigation strategy, like the expand/contract strategy present in doc.char_span. However at present this PR just throws an error if a target can't be mapped.

This is still in progress.

Types of change

Bug fix.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

This runs, but the results are nonsense because the indices are off.

This changes the tok2vec size in coref to hardcoded 64 to get tests to run. This should be reverted and hopefully replaced with proper shape inference.

This may not be done yet, as the test is just for consistency, and not overfitting correctly yet.

Had to renumber error message.

This test only fails due to the explicity assert False at the moment, but the debug output shows that the learned spans are all off by one due to misalignment. So the code still needs fixing.

I believe this resolves issues with tokenization mismatches.

polm · 2022-07-03T11:20:47Z

@explosion-bot Please test_slow_gpu

polm · 2022-07-03T11:23:26Z

I believe this should be working well enough to merge now.

There's still a question of what to do when annotations don't have compatible boundaries between gold and predicted Docs. Currently coref gives up and the span predictor will ignore those spans (zero gradients); I don't have strong feelings about which of these are better. As mentioned in the initial post here I could also see offering mitigation strategies make sense (I feel like expand would usually be appropriate), though it would make the component more complicated.

spacy/ml/models/coref_util.py

adrianeboyd · 2022-07-04T07:15:06Z

There is one remaining catch with using character offsets with Example, and that's that the whitespace isn't required to align between the predicted and reference docs, so working with character offsets could lead to unexpected/incorrect results.

eg = Example(nlp.make_doc("\n\nThis   is  an    example."), nlp("This is an example."))

I think this is rare in practice and it's okay if the component can't handle it, but there should be explicit errors in all the relevant spots that this component can't handle this kind of example.

polm · 2022-07-04T07:39:12Z

Oh wow, I was not aware that was allowed. I'll work on adding checks that the character spans have the same contents after being translated from the reference to the predicted doc.

adrianeboyd · 2022-07-04T07:44:20Z

I think you could just check that the texts of the predicted and reference docs are the same? (You could also lowercase them if you wanted to allow little more of the original leeway.)

The actual requirement is:

spaCy/spacy/training/align.pyx

Lines 16 to 20 in 7c1bf2f

    
           # Check that the two texts only differ in whitespace and capitalization 
        
           if re.sub(r"\s+", "", str_a) != re.sub(r"\s+", "", str_b) or \ 
        
                   len_str_a != len(char_to_token_a) or \ 
        
                   len_str_b != len(char_to_token_b): 
        
               raise ValueError(Errors.E949.format(x=str(A[:10]), y=str(B[:10])))

(We ran into problems at one point because Turkish dotless i is not the same number of characters uppercase and lowercase.)

polm · 2022-07-04T07:48:58Z

Thanks, that makes it clear, and checking up front would definitely be easier. I was thinking that if the checks were only in the required places it would make them easier to remove / fix later, but it's probably not that big a difference.

Docs in Examples are allowed to have arbitrarily different whitespace. Handling that properly would be nice but isn't required, but for now check for it and blow up.

polm · 2022-07-04T10:39:24Z

OK, the point about differing whitespace should be addressed now.

adrianeboyd · 2022-07-05T09:54:16Z

@explosion-bot please test_gpu

explosion-bot · 2022-07-05T09:55:08Z

⏳ Triggering 'GPU Tests' for 'spaCy' on Buildkite...

explosion-bot · 2022-07-05T09:55:10Z

🚨 Errors

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/testing.py", line 329, in invoke
    cli.main(args=args or (), prog_name=prog_name, **extra)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/explosionbot/github_commands/test.py", line 174, in test_gpu
    test_command(repo_name, test_suite, thinc_branch, spacy_branch, run_on)
  File "/opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/explosionbot/github_commands/test.py", line 139, in test_command
    branch = get_pr_branch()
NameError: name 'get_pr_branch' is not defined

adrianeboyd · 2022-07-05T10:21:05Z

@explosion-bot please test_gpu

explosion-bot · 2022-07-05T10:21:48Z

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/97

spacy/ml/models/coref_util.py

kadarakos · 2022-07-05T08:41:35Z

spacy/pipeline/span_predictor.py

            n_classes = start_scores.shape[1]
            start_probs = ops.softmax(start_scores, axis=1)
            end_probs = ops.softmax(end_scores, axis=1)
            start_targets = to_categorical(starts, n_classes)
            end_targets = to_categorical(ends, n_classes)
            start_grads = start_probs - start_targets
            end_grads = end_probs - end_targets
-            grads = ops.xp.stack((start_grads, end_grads), axis=2)
+            # now return to original shape, with 0s
+            final_start_grads = ops.alloc2f(*span_scores[:, :, 0].shape)


for me it might be clearer if it used the to me more familiar method of having a mask instead like grads * mask where the mask is 1 for keeps and 0 otherwise, but I think this does the same thing actually.

If the array was full size, with fields that were to be masked out later, wouldn't that change the softmax caculation? Besides that I think these are equivalent.

spacy/pipeline/coref.py

spacy/ml/models/coref_util.py

kadarakos · 2022-07-05T11:01:49Z

spacy/tests/pipeline/test_coref.py


-    for i in range(5):
+    # Needs ~12 epochs to converge


Maybe this test can be sped up by setting the learn_rate of Adam to 1.0 ?

That might work, but running the whole file takes 5s on my machine, so I'm not sure it's all that slow anyway?

spacy/tests/pipeline/test_coref.py

spacy/tests/pipeline/test_span_predictor.py

Co-authored-by: kadarakos <kadar.akos@gmail.com>

Basically the same as get_clusters_from_doc

polm · 2022-07-11T10:15:36Z

Merged this to get all the coref changes in one place rather than having several small PRs.

polm added 3 commits June 28, 2022 19:05

Initial test of mismatched tokenization

af6d5ae

This runs, but the results are nonsense because the indices are off.

Bad hack to get tests to run

ef5762d

This changes the tok2vec size in coref to hardcoded 64 to get tests to run. This should be reverted and hopefully replaced with proper shape inference.

Test works

d1ff933

This may not be done yet, as the test is just for consistency, and not overfitting correctly yet.

polm added the feat / coref Feature: Coreference resolution label Jun 28, 2022

polm requested a review from kadarakos June 29, 2022 10:03

polm added 10 commits July 1, 2022 19:09

Merge branch 'feature/coref' into fix/coref-alignment

7972088

Had to renumber error message.

Clean tests.

5192ac1

Run black

1dacecb

Move spans2ints to util

201731d

Add basic span predictor tests

1a4dbb7

Use config to specify tok2vec_size

619b110

Add failing test with tokenization mismatch

a46bc03

This test only fails due to the explicity assert False at the moment, but the debug output shows that the learned spans are all off by one due to misalignment. So the code still needs fixing.

Update overfitting test

fd574a8

Update tests

cf33b48

Fix alignment issues

b09bbc7

I believe this resolves issues with tokenization mismatches.

polm marked this pull request as ready for review July 3, 2022 11:17

polm added the bug Bugs and behaviour differing from documentation label Jul 3, 2022

adrianeboyd reviewed Jul 4, 2022

View reviewed changes

spacy/ml/models/coref_util.py Outdated Show resolved Hide resolved

polm added 2 commits July 4, 2022 19:28

Rename spans2ints > _spans_to_offsets

c7f333d

Add tests to give up with whitespace differences

178feae

Docs in Examples are allowed to have arbitrarily different whitespace. Handling that properly would be nice but isn't required, but for now check for it and blow up.

polm requested a review from adrianeboyd July 5, 2022 05:20

kadarakos reviewed Jul 5, 2022

View reviewed changes

spacy/tests/pipeline/test_span_predictor.py Outdated Show resolved Hide resolved

polm and others added 3 commits July 6, 2022 13:46

Update spacy/ml/models/coref_util.py

63e27b5

Co-authored-by: kadarakos <kadar.akos@gmail.com>

Feedback from code review

8f598d7

Remove _spans_to_offsets

6f5cf83

Basically the same as get_clusters_from_doc

polm mentioned this pull request Jul 6, 2022

Dimension inference in Coref #11089

Merged

3 tasks

polm requested a review from kadarakos July 6, 2022 10:44

Merge branch 'feature/coref' into fix/coref-alignment

6d9eafe

polm merged commit 9cbb970 into explosion:feature/coref Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenization mismatch handling in coref #11042

Fix tokenization mismatch handling in coref #11042

polm commented Jun 28, 2022

polm commented Jul 3, 2022

polm commented Jul 3, 2022

adrianeboyd commented Jul 4, 2022

polm commented Jul 4, 2022

adrianeboyd commented Jul 4, 2022

polm commented Jul 4, 2022

polm commented Jul 4, 2022

adrianeboyd commented Jul 5, 2022

explosion-bot commented Jul 5, 2022

explosion-bot commented Jul 5, 2022

adrianeboyd commented Jul 5, 2022

explosion-bot commented Jul 5, 2022 •

edited

Loading

kadarakos Jul 5, 2022

polm Jul 6, 2022

kadarakos Jul 5, 2022

polm Jul 6, 2022

polm commented Jul 11, 2022

Fix tokenization mismatch handling in coref #11042

Fix tokenization mismatch handling in coref #11042

Conversation

polm commented Jun 28, 2022

Description

Types of change

Checklist

polm commented Jul 3, 2022

polm commented Jul 3, 2022

adrianeboyd commented Jul 4, 2022

polm commented Jul 4, 2022

adrianeboyd commented Jul 4, 2022

polm commented Jul 4, 2022

polm commented Jul 4, 2022

adrianeboyd commented Jul 5, 2022

explosion-bot commented Jul 5, 2022

explosion-bot commented Jul 5, 2022

🚨 Errors

adrianeboyd commented Jul 5, 2022

explosion-bot commented Jul 5, 2022 • edited Loading

kadarakos Jul 5, 2022

Choose a reason for hiding this comment

polm Jul 6, 2022

Choose a reason for hiding this comment

kadarakos Jul 5, 2022

Choose a reason for hiding this comment

polm Jul 6, 2022

Choose a reason for hiding this comment

polm commented Jul 11, 2022

explosion-bot commented Jul 5, 2022 •

edited

Loading