Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native coref component #7264

Closed
wants to merge 215 commits into from
Closed

Native coref component #7264

wants to merge 215 commits into from

Conversation

svlandeg
Copy link
Member

@svlandeg svlandeg commented Mar 3, 2021

Work-in-progress

Description

Creating a native coref component in spaCy

Types of change

new feature

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

* initial coref_er pipe

* matcher more flexible

* base coref component without actual model

* initial setup of coref_er.score

* rename to include_label

* preliminary score_clusters method

* apply scoring in coref component

* IO fix

* return None loss for now

* rename to CoreferenceResolver

* some preliminary unit tests

* use registry as callable
@svlandeg
Copy link
Member Author

svlandeg commented Mar 3, 2021

Status March 1:

  • Wrote preliminary v3-compatible framework to facilitate experimentation with different coref models
  • Currently assuming two different pipeline components:
    • coref_er / CorefEntityRecognizer is a rule-based mention detection algorithm: uses noun chunks, POS tags and named entities
    • coref / CoreferenceResolver assembles the provided mentions into clusters (dummy implementation)
  • Using doc.spans to store the information:
    • doc.spans[coref_mentions] for storing all relevant coref mentions (nouns, pronouns, names, ...)
    • doc.spans[coref_clusters_i] for different clusters, indexed with i
  • Coref.v0 needs to be implemented and changed to Coref.v1
  • Scorer.score_clusters method that currently uses a too simple scoring mechanism (binary relations between mentions), should be refined with actual coref scoring algorithm

While all of this is mostly dummy framework, it already helped discover some bugs & required functionality, cf PRs #7197, #7209 and #7225.

Going forward, having this bare framework should facilitate working on this functionality with different people in parallel, filling in different parts...

TODO

  • Implement proper coref ML model
  • Proper mention detection algorithm, rule-based, ML-based, something like the SpanCategorizer, ...
  • Meaningful evaluation script
  • Tune & benchmark
  • Rewrite errors to use spacy.errors

Open questions / current issues

  • While we talked about keeping doc.spans a relatively simple dictionary of strings mapping to lists of spans, we might consider having a more formal way of defining clusters that belong together - currently this is done by matching a prefix in the spans key, which is obviously not ideal

  • The design with the rule-based coref_er is again awkward, because this component won't run during nlp.update, meaning that the coref model could only train on gold mentions, which is not a good idea in terms of generalizability and robustness of the ML model.

@svlandeg svlandeg added enhancement Feature requests and improvements feat / coref Feature: Coreference resolution ⚠️ wip Work in progress labels Mar 3, 2021
@svlandeg svlandeg changed the title Native coref component (#7243) Native coref component Mar 3, 2021
@LifeIsStrange
Copy link

Just saying that I hope that the state of the art
will be available eventually.

Anyway this is a very welcome improvement that I'm looking forward :)

polm and others added 23 commits May 15, 2021 20:05
This includes the coref code that was being tested separately, modified
to work in spaCy. It hasn't been tested yet and presumably still needs
fixes.

In particular, the evaluation code is currently omitted. It's unclear at
the moment whether we want to use a complex scorer similar to the
official one, or a simpler scorer using more modern evaluation methods.
Ended up not making a difference, but oh well.
When sentences are not available, just treat the whole doc as one
sentence. A reasonable general fallback, but important due to the init
call, where upstream components aren't run.
Training seems to actually run now!
This makes their scope tighter and more contained, and has the nice side
effect that fewer things need to be passed around for backprop.
The loss was being returned as a single element array, which caused
training to die when it attempted to turn it into JSON.
This is closer to the traditional evaluation method. That uses an
average of three scores, this is just using the bcubed metric for now
(nothing special about bcubed, just picked one).

The scoring implementation comes from the coval project. It relies on
scipy, which is one issue, and is rather involved, which is another.

Besides being comparable with traditional evaluations, this scoring is
relatively fast.
The intent of this was that it would be a component pipeline that used
entities as input, but that's now covered by the get_mentions function
as a pipeline arg.
polm added 5 commits July 12, 2022 12:56
There's no guarantee about the order in which SpanGroup keys will come
out, so access them in sorted order when doing comparisons.
This was necessary when the tok2vec_size option was necessary.
This was probably used in the prototyping stage, left as a reference,
and then forgotten. Nothing uses it any more.
@polm
Copy link
Contributor

polm commented Jul 12, 2022

@explosion-bot please test_gpu

@explosion explosion unlocked this conversation Jul 12, 2022
@polm
Copy link
Contributor

polm commented Jul 12, 2022

@explosion-bot please test_gpu

@explosion-bot
Copy link
Collaborator

explosion-bot commented Jul 12, 2022

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/100

@explosion explosion locked and limited conversation to collaborators Jul 12, 2022
@svlandeg
Copy link
Member Author

Closing this PR, as we'll release the functionality in spacy-experimental first: explosion/spacy-experimental#17

The docs PR is here: #11291

@svlandeg svlandeg closed this Aug 11, 2022
@svlandeg
Copy link
Member Author

svlandeg commented Oct 6, 2022

Just wanted to send a quick update about coref support in spaCy:

We'd love for you to try this out, and any feedback is very welcome over at the discussion forum!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / coref Feature: Coreference resolution ⚠️ wip Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.