Support lazy, recursive sentence splitting #7

danieldk · 2022-03-02T15:52:37Z

We use sentence splitting in the biaffine parser to keep the O(n^2) biaffine attention model tractable. However, since the sentence splitter makes errors, the parser may not have the correct head available.

This change adds another splitting strategy as the preferred splitting. The goal of this strategy is to split up a Doc into pieces that are as large as possible given a maximum n_max. This reduces the number of attachment errors as a result of incorrect sentence splits, while providing an upper bound on complexity (O(n_max^2)).

The algorithm works as follows:

If the length |d| > max_length:
- Find the highest-probability split in d according to senter.
- Split d into d_1 and d_2 using the highest probability split.
- Recursively apply this algorithm to d_1 and d_2.
Otherwise: do nothing

Note: draft, requires functionality from PR explosion/spaCy#11002, which targets spaCy v4.

We use sentence splitting in the biaffine parser to keep the O(n^2) biaffine attention model tractable. However, since the sentence splitter makes errors, the parser may not have the correct head available. This change adds another splitting strategy. The goal of this strategy is to use the highest-probability splits to partition a doc until each partition is smaller than or equal to a maximum length. This reduces the number of attachment errors as a result of incorrect sentence splits, while providing an upper bound on complexity. The algorithm works as follows: * If the length |d| > max_length: - Find the highest-probability split in d according to senter. - Split d into d_1 and d_2 using the highest probability split. - Recursively apply this algorithm to d_1 and d_2. * Otherwise: do nothing

We use a back-off when the first token is the best splitting point, to avoid an infinite recursion. The back-off was simply to use the second token, refine this to choose the second-most probable splitting point.

We now use the spaCy parser scorer.

Now that Thinc doesn't set the Tensor type globally anymore, we have to make sure that Tensors are placed on the correct device.

Before this change, we'd use the senter pipe directly. However, this did not work with the transformer model without modifications (because it clears tensors after backprop). By using the functionality proposed in explosion/spaCy#11002 we can use the activations that are stored by the senter pipe in `Doc`.

This measure was removed.

Also make evaluation targets depend on the corpus they use.

@kadarakos

Suggested by @kadarakos

danieldk · 2023-02-24T12:46:49Z

Please don't review this PR, I am doing some force pushes here, mostly to get this building against spaCy v4. I'll close this PR and open up a new one when it's ready for review.

kadarakos · 2023-02-27T13:39:25Z

spacy_experimental/biaffine_parser/arc_predicter.pyx

+    for doc in docs:
+        activations = doc.activations.get(senter.name, None)
+        if activations is None:
+            raise ValueError("Greedy splitting requires `senter` with `save_activations` enabled.\n"


Suggested change

raise ValueError("Greedy splitting requires `senter` with `save_activations` enabled.\n"

raise ValueError("Lazy splitting requires `senter` with `save_activations` enabled.\n"

I'll fix that, but we really shouldn't use this as a reviewing PR, it already had force pushes and stuff, so I'll close it.

danieldk force-pushed the greedy-recursive-splitting branch 2 times, most recently from e66053a to e918001 Compare March 3, 2022 19:27

danieldk mentioned this pull request Oct 12, 2022

Bugfix in biaffine parser: Pytorch tensors are now created on the right device #25

Merged

danieldk added 18 commits February 24, 2023 10:03

Update the v4 branch for spaCy v4 and make sure that all tests pass

563edcf

ArcPredicter: better back-off

d1ac35b

We use a back-off when the first token is the best splitting point, to avoid an infinite recursion. The back-off was simply to use the second token, refine this to choose the second-most probable splitting point.

Remove custom dependency evaluation function

65eb57c

ArcLabeler.set_annotations: do not iterate over sentences

d8ade4c

ArcLabeler: simplify loop

54710b2

Remove biaffine parser scorer from setup.cfg

de6b600

We now use the spaCy parser scorer.

pairwise_bilinear: fix a typo

d61951f

PairwiseBilinearModel: correctly set device for auxiliary arrays

bb64977

Now that Thinc doesn't set the Tensor type globally anymore, we have to make sure that Tensors are placed on the correct device.

Add some comments from pair review

85fae1f

Typing fixes

93ca4c7

Remove bound_las/uas from the base config

9a23bb4

This measure was removed.

Example project: add transformer to annotating components

46be511

Example project: merge subtokens

c43f204

Example: add evaluate-dev target

d2a0c4c

Also make evaluation targets depend on the corpus they use.

Simplify split seach

996df6c

Suggested by @kadarakos

Sync to changes in PRs on which this one depends

7e8b69c

danieldk force-pushed the greedy-recursive-splitting branch from 13304a3 to 7e8b69c Compare February 24, 2023 12:45

store_activations -> save_activations

47e1b21

kadarakos reviewed Feb 27, 2023

View reviewed changes

danieldk closed this Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support lazy, recursive sentence splitting #7

Support lazy, recursive sentence splitting #7

danieldk commented Mar 2, 2022 •

edited

Loading

danieldk commented Feb 24, 2023 •

edited

Loading

kadarakos Feb 27, 2023

danieldk Feb 27, 2023

	raise ValueError("Greedy splitting requires `senter` with `save_activations` enabled.\n"
	raise ValueError("Lazy splitting requires `senter` with `save_activations` enabled.\n"

Support lazy, recursive sentence splitting #7

Support lazy, recursive sentence splitting #7

Conversation

danieldk commented Mar 2, 2022 • edited Loading

danieldk commented Feb 24, 2023 • edited Loading

kadarakos Feb 27, 2023

Choose a reason for hiding this comment

danieldk Feb 27, 2023

Choose a reason for hiding this comment

danieldk commented Mar 2, 2022 •

edited

Loading

danieldk commented Feb 24, 2023 •

edited

Loading