-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support lazy, recursive sentence splitting #7
Conversation
e66053a
to
e918001
Compare
We use sentence splitting in the biaffine parser to keep the O(n^2) biaffine attention model tractable. However, since the sentence splitter makes errors, the parser may not have the correct head available. This change adds another splitting strategy. The goal of this strategy is to use the highest-probability splits to partition a doc until each partition is smaller than or equal to a maximum length. This reduces the number of attachment errors as a result of incorrect sentence splits, while providing an upper bound on complexity. The algorithm works as follows: * If the length |d| > max_length: - Find the highest-probability split in d according to senter. - Split d into d_1 and d_2 using the highest probability split. - Recursively apply this algorithm to d_1 and d_2. * Otherwise: do nothing
We use a back-off when the first token is the best splitting point, to avoid an infinite recursion. The back-off was simply to use the second token, refine this to choose the second-most probable splitting point.
We now use the spaCy parser scorer.
Now that Thinc doesn't set the Tensor type globally anymore, we have to make sure that Tensors are placed on the correct device.
Before this change, we'd use the senter pipe directly. However, this did not work with the transformer model without modifications (because it clears tensors after backprop). By using the functionality proposed in explosion/spaCy#11002 we can use the activations that are stored by the senter pipe in `Doc`.
This measure was removed.
Also make evaluation targets depend on the corpus they use.
Suggested by @kadarakos
13304a3
to
7e8b69c
Compare
Please don't review this PR, I am doing some force pushes here, mostly to get this building against spaCy v4. I'll close this PR and open up a new one when it's ready for review. |
for doc in docs: | ||
activations = doc.activations.get(senter.name, None) | ||
if activations is None: | ||
raise ValueError("Greedy splitting requires `senter` with `save_activations` enabled.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError("Greedy splitting requires `senter` with `save_activations` enabled.\n" | |
raise ValueError("Lazy splitting requires `senter` with `save_activations` enabled.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix that, but we really shouldn't use this as a reviewing PR, it already had force pushes and stuff, so I'll close it.
We use sentence splitting in the biaffine parser to keep the O(n^2) biaffine attention model tractable. However, since the sentence splitter makes errors, the parser may not have the correct head available.
This change adds another splitting strategy as the preferred splitting. The goal of this strategy is to split up a
Doc
into pieces that are as large as possible given a maximum n_max. This reduces the number of attachment errors as a result of incorrect sentence splits, while providing an upper bound on complexity (O(n_max^2)).The algorithm works as follows:
Note: draft, requires functionality from PR explosion/spaCy#11002, which targets spaCy v4.