Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lazy, recursive sentence splitting #7

Closed
wants to merge 19 commits into from

Conversation

danieldk
Copy link
Contributor

@danieldk danieldk commented Mar 2, 2022

We use sentence splitting in the biaffine parser to keep the O(n^2) biaffine attention model tractable. However, since the sentence splitter makes errors, the parser may not have the correct head available.

This change adds another splitting strategy as the preferred splitting. The goal of this strategy is to split up a Doc into pieces that are as large as possible given a maximum n_max. This reduces the number of attachment errors as a result of incorrect sentence splits, while providing an upper bound on complexity (O(n_max^2)).

The algorithm works as follows:

  • If the length |d| > max_length:
    • Find the highest-probability split in d according to senter.
    • Split d into d_1 and d_2 using the highest probability split.
    • Recursively apply this algorithm to d_1 and d_2.
  • Otherwise: do nothing

Note: draft, requires functionality from PR explosion/spaCy#11002, which targets spaCy v4.

We use sentence splitting in the biaffine parser to keep the O(n^2)
biaffine attention model tractable. However, since the sentence splitter
makes errors, the parser may not have the correct head available.

This change adds another splitting strategy. The goal of this strategy
is to use the highest-probability splits to partition a doc until each
partition is smaller than or equal to a maximum length. This reduces
the number of attachment errors as a result of incorrect sentence
splits, while providing an upper bound on complexity.

The algorithm works as follows:

* If the length |d| > max_length:
  - Find the highest-probability split in d according to senter.
  - Split d into d_1 and d_2 using the highest probability split.
  - Recursively apply this algorithm to d_1 and d_2.
* Otherwise: do nothing
We use a back-off when the first token is the best splitting point, to
avoid an infinite recursion. The back-off was simply to use the second
token, refine this to choose the second-most probable splitting point.
We now use the spaCy parser scorer.
Now that Thinc doesn't set the Tensor type globally anymore, we have to
make sure that Tensors are placed on the correct device.
Before this change, we'd use the senter pipe directly. However, this did
not work with the transformer model without modifications (because it
clears tensors after backprop). By using the functionality proposed in

explosion/spaCy#11002

we can use the activations that are stored by the senter pipe in `Doc`.
Also make evaluation targets depend on the corpus they use.
@danieldk
Copy link
Contributor Author

danieldk commented Feb 24, 2023

Please don't review this PR, I am doing some force pushes here, mostly to get this building against spaCy v4. I'll close this PR and open up a new one when it's ready for review.

for doc in docs:
activations = doc.activations.get(senter.name, None)
if activations is None:
raise ValueError("Greedy splitting requires `senter` with `save_activations` enabled.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("Greedy splitting requires `senter` with `save_activations` enabled.\n"
raise ValueError("Lazy splitting requires `senter` with `save_activations` enabled.\n"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix that, but we really shouldn't use this as a reviewing PR, it already had force pushes and stuff, so I'll close it.

@danieldk danieldk closed this Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants