Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SpanGroup and Graph container types to represent arbitrary annotations #6696

Merged
merged 42 commits into from
Jan 14, 2021

Conversation

honnibal
Copy link
Member

@honnibal honnibal commented Jan 8, 2021

Proposal

The Doc object has specific "slots" for the core annotations, which are heavily constrained for both efficiency and API simplicity. The only way to store arbitrary annotations has been to place data in doc.user_dict and then access it via the extension attribute system.

This PR provides native support for two additional container types, for more flexible type of information storage.

  • SpanGroup: A sequence of labelled spans.
  • Graph: A sequence of labelled, directed relations between sets of tokens. The nodes of the graph (the tokens) don't have to form contiguous spans. Nodes can also be empty, allowing arbitrary labels to be attached to the token groups.

Two new attributes are added to the Doc that components can use to store their annotations:

  • doc.spans (`Dict[str, SpanGroup])
  • doc.graphs (Dict[str, Graph])

Pipeline components could then be configured with a string under which to store their annotations. For instance, we expect to add a built-in coreference coreference component. With its default configuration, it would write to doc.spans["coref"]. An alternative coreference component could be configured to write to the same key, or a different one if you want to store annotations from both.

SpanGroup

The new SpanGroup is a named list of Span objects. Arbitrary json-serializable attributes can also be attached to the SpanGroup. The Doc object is given a new dict attribute, doc.span_groups, whose keys are strings and whose values are SpanGroup objects.

Example use-cases

  • Storing the output of multiple NER passes
  • Storing coreference resolution data
  • Storing predicate-argument structures

Example usage

spans = doc.create_span_group("coref_chain1")
for start, end, ent_id in my_coref_chain:
    spans.append(Span(doc, start=start, end=end, kb_id=kb_id))
spans.attrs["task"] = "coreference resolution"
# You can also access the dict directly.
doc.span_groups["coref_chain2"] = SpanGroup(doc, name="coref_chain2", attrs={"task": "coreference resolution"})

Implementation details

The main trick here is avoiding reference cycles. The Span object holds a reference to the Doc, so we don't want the Doc object to hold references to actual Span objects. Otherwise, the reference counting won't be able to free the Doc (its count will never drop to zero), and we'll have to rely on the garbage collection. Relying on garbage collection is bad: it means the memory accumulates, it introduces pauses, and it makes destructors very difficult to reason about (because you don't know when the destructor will be called). It's especially problematic for managing GPU memory, because the garbage collection is triggered by memory pressure, which doesn't consider pressure is on GPU resources.

To avoid the reference cycles, the SpanGroup object owns a weakref to the Doc, which doesn't increase the reference count, and stores the span data using a vector[SpanC]. This required a small refactor to the Span object to make it use the SpanC object to hold its internal data.

An alternative to the weakref would be to require the Doc to be passed in explicitly when fetching data back out of the SpanGroup. This would stop us from having the span group work like a list; we couldn't have span = span_group[i]. We would need to have something like span = span_group.get(doc, i), which isn't very nice imo.

Decisions to debate

  1. Should it be doc.spans or doc.span_groups?

Just from the name, I'd guess doc.spans would be a list, not a dict. On the other hand that will only surprise you once, and after that you can stop writing doc.span_groups, which feels inconsistent with the preference for brevity elsewhere in the API.

I suppose we could make doc.spans a property that iterates out all the spans in the doc.span_groups. Seems somewhat pointless though?

  1. Should it be a list?

We could make it just a list, but I think it's nice to fetch the span group by name.

  1. Should we make the SpanGroups data structure manage the subgroups?

We could have some more complex container that has an API for giving out all of the span groups and then named subgroups of them as well. So it would be like doc.spans.group("coref") or something. I think this will send people to the docs all the time to remember the API specifics? A dict is only a little bit less concise, and it feels so much simpler.

There are some arguments for making an object to manage the whole dict though. One is that the whole dict needs to be serialized together; we currently use some dict comprehensions for this. Another is that we could have one weakref owned by the whole container, rather than giving each span group its own reference. We also have a little bit of duplication in the current data structure: the names are stored twice, once as keys and again as an attribute of the SpanGroup. This information can get out-of-sync, because we have it twice.

  1. The doc.create_span_group is ugly

You can't create a SpanGroup without passing in the Doc object, so I added this helper to add a new span group. It feels weird though.

  1. We should have a span_groups field in the Doc.__init__, right?

It's a bit annoying because we'll have to pass in the span data in dict format. What we musn't do is give in to the temptation to do the evil (deprecated) tuple format for a span, like (label, start, end).

Graph

This is much more drafty, but I've done some initial work on it and I think it seems promising. Directed labelled graphs can represent pretty much anything (although you might need to transform it before you can query it in any practical way). spaCy's philosophy has previously been that directed graphs are too under-constrained, which doesn't let us build good APIs for linguistic annotations. Most linguistic annotations aren't arbitrary, they have specific structure, so we can build much tighter interfaces for them. For instance, the tree constraint works really well for about 98% of syntax, and letting a word have multiple heads makes the structure really difficult to use.

Still, sometimes you do want a graph, e.g. for predicate-argument structures. So we should have this type in the library. This is especially true because the Python ecosystem doesn't actually have many options for this. The networkx package is the most popular, but it's a pretty big library and I have doubts that a pure-Python implementation is well suited to our use-case. graph-tool looks much more appealing to me, but it also has a pretty wide scope, and I doubt there are plans to package it for pip (it seems really hard). I think we should have our own graph type and let people export to these other packages for stuff like visualisation, instead of having them as a dependency.

TODO

  • Implement
  • Implement serialization
  • Test span groups are preserved in Doc byte serialization
  • Test span groups are preserved in DocBin byte serialization
  • Finalize naming and usage
  • Write docs

Future work

  • Use SpanGroup for beam results?
  • Plan for coref?
  • Support SpanGroup in Matcher?
  • Document Graph and refine its API

@honnibal honnibal added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Jan 8, 2021
@honnibal honnibal changed the title WIP: AllowDoc object to track named span groups WIP: Allow Doc object to track named span groups Jan 9, 2021
@honnibal honnibal changed the title WIP: Allow Doc object to track named span groups WIP: Add SpanGroup and Graph container types to represent arbitrary annotations Jan 11, 2021
@svlandeg svlandeg added the feat / coref Feature: Coreference resolution label Jan 11, 2021
spacy/tokens/_serialize.py Outdated Show resolved Hide resolved
spacy/tokens/doc.pyx Outdated Show resolved Hide resolved
@svlandeg
Copy link
Member

My two cents:

Just from the name, I'd guess doc.spans would be a list, not a dict. On the other hand that will only surprise you once, and after that you can stop writing doc.span_groups, which feels inconsistent with the preference for brevity elsewhere in the API.

Agreed with the trade-off here, I think doc.spans will just be so much cleaner in everyone's code. And you remember it much more easily than doc.span_groups.

I suppose we could make doc.spans a property that iterates out all the spans in the doc.span_groups. Seems somewhat pointless though?

If we do call the dictionary doc.spans, we could have a function or a property doc.list_spans that lists all spans? But then each Span should probably have a reference to the original group it belonged to? I can imagine some use-cases preferring the "list view", but I don't think it's really necessary. We can probably just do without.

We could make it just a list, but I think it's nice to fetch the span group by name.

Should we make the SpanGroups data structure manage the subgroups?

I would keep the dictionary structure, I quite like it as such, and the API is more straightforward, as you argued.

We also have a little bit of duplication in the current data structure: the names are stored twice, once as keys and again as an attribute of the SpanGroup.

I wondered about this redundancy too. It's a little bit like the pipeline component names, that store their own name but then the Language object stores them by name as well. I worried about that before too, but I guess it usually doesn't happen that they go out-of-sync. At least I've never seen any major problems with that.

The doc.create_span_group is ugly

Alternatively, this function could be called add_span_group or add_spans (if we rename to doc.spans) and we could optionally give it a list of spans? So we'd have

span_group = doc.add_spans("hi", [Span(doc, 3, 4, label="bye")])

instead of

span_group = doc.create_span_group("hi")
span_group.append(Span(doc, 3, 4, label="bye"))

We should have a span_groups field in the Doc.__init__, right? It's a bit annoying because we'll have to pass in the span data in dict format.

What we musn't do is give in to the temptation to do the evil (deprecated) tuple format for a span, like (label, start, end).

Hmm yes, we really should avoid the tuple format. But then how to do this? We can't make a Span before we created the Doc?

@honnibal
Copy link
Member Author

After some discussion, we're currently at the following answers:

Should it be doc.spans or doc.span_groups?

doc.spans

Should it be a list?

Nah.

Should we make the SpanGroups data structure manage the subgroups?

Make a very thin SpanGroups container that works just like a dict, but handles the serialization and allows assignment of a list of spans.

The doc.create_span_group is ugly

Kill it.

We should have a span_groups field in the Doc.init, right?

Ugh, maybe? Not currently implemented though.

@honnibal honnibal changed the title WIP: Add SpanGroup and Graph container types to represent arbitrary annotations Add SpanGroup and Graph container types to represent arbitrary annotations Jan 14, 2021
@honnibal honnibal merged commit f277bfd into develop Jan 14, 2021
@ines ines deleted the feature/spans-on-doc branch January 30, 2021 08:54
This was referenced Mar 1, 2021
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request May 11, 2021
Also allow `Span` string properties `label_` and `kb_id_` to be writable
following explosion#6696.
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request May 11, 2021
Also allow `Span` string properties `label_` and `kb_id_` to be writable
following explosion#6696.
@adrianeboyd adrianeboyd mentioned this pull request May 11, 2021
3 tasks
honnibal pushed a commit that referenced this pull request May 17, 2021
Also allow `Span` string properties `label_` and `kb_id_` to be writable
following #6696.
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request May 19, 2021
Also allow `Span` string properties `label_` and `kb_id_` to be writable
following explosion#6696.
svlandeg pushed a commit to svlandeg/spaCy that referenced this pull request May 26, 2021
Also allow `Span` string properties `label_` and `kb_id_` to be writable
following explosion#6696.
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request May 31, 2021
Also allow `Span` string properties `label_` and `kb_id_` to be writable
following explosion#6696.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / coref Feature: Coreference resolution feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants