Add SpanGroup and Graph container types to represent arbitrary annotations #6696

honnibal · 2021-01-08T07:13:19Z

Proposal

The Doc object has specific "slots" for the core annotations, which are heavily constrained for both efficiency and API simplicity. The only way to store arbitrary annotations has been to place data in doc.user_dict and then access it via the extension attribute system.

This PR provides native support for two additional container types, for more flexible type of information storage.

SpanGroup: A sequence of labelled spans.
Graph: A sequence of labelled, directed relations between sets of tokens. The nodes of the graph (the tokens) don't have to form contiguous spans. Nodes can also be empty, allowing arbitrary labels to be attached to the token groups.

Two new attributes are added to the Doc that components can use to store their annotations:

doc.spans (`Dict[str, SpanGroup])
doc.graphs (Dict[str, Graph])

Pipeline components could then be configured with a string under which to store their annotations. For instance, we expect to add a built-in coreference coreference component. With its default configuration, it would write to doc.spans["coref"]. An alternative coreference component could be configured to write to the same key, or a different one if you want to store annotations from both.

SpanGroup

The new SpanGroup is a named list of Span objects. Arbitrary json-serializable attributes can also be attached to the SpanGroup. The Doc object is given a new dict attribute, doc.span_groups, whose keys are strings and whose values are SpanGroup objects.

Example use-cases

Storing the output of multiple NER passes
Storing coreference resolution data
Storing predicate-argument structures

Example usage

spans = doc.create_span_group("coref_chain1")
for start, end, ent_id in my_coref_chain:
    spans.append(Span(doc, start=start, end=end, kb_id=kb_id))
spans.attrs["task"] = "coreference resolution"
# You can also access the dict directly.
doc.span_groups["coref_chain2"] = SpanGroup(doc, name="coref_chain2", attrs={"task": "coreference resolution"})

Implementation details

The main trick here is avoiding reference cycles. The Span object holds a reference to the Doc, so we don't want the Doc object to hold references to actual Span objects. Otherwise, the reference counting won't be able to free the Doc (its count will never drop to zero), and we'll have to rely on the garbage collection. Relying on garbage collection is bad: it means the memory accumulates, it introduces pauses, and it makes destructors very difficult to reason about (because you don't know when the destructor will be called). It's especially problematic for managing GPU memory, because the garbage collection is triggered by memory pressure, which doesn't consider pressure is on GPU resources.

To avoid the reference cycles, the SpanGroup object owns a weakref to the Doc, which doesn't increase the reference count, and stores the span data using a vector[SpanC]. This required a small refactor to the Span object to make it use the SpanC object to hold its internal data.

An alternative to the weakref would be to require the Doc to be passed in explicitly when fetching data back out of the SpanGroup. This would stop us from having the span group work like a list; we couldn't have span = span_group[i]. We would need to have something like span = span_group.get(doc, i), which isn't very nice imo.

Decisions to debate

Should it be doc.spans or doc.span_groups?

Just from the name, I'd guess doc.spans would be a list, not a dict. On the other hand that will only surprise you once, and after that you can stop writing doc.span_groups, which feels inconsistent with the preference for brevity elsewhere in the API.

I suppose we could make doc.spans a property that iterates out all the spans in the doc.span_groups. Seems somewhat pointless though?

Should it be a list?

We could make it just a list, but I think it's nice to fetch the span group by name.

Should we make the SpanGroups data structure manage the subgroups?

We could have some more complex container that has an API for giving out all of the span groups and then named subgroups of them as well. So it would be like doc.spans.group("coref") or something. I think this will send people to the docs all the time to remember the API specifics? A dict is only a little bit less concise, and it feels so much simpler.

There are some arguments for making an object to manage the whole dict though. One is that the whole dict needs to be serialized together; we currently use some dict comprehensions for this. Another is that we could have one weakref owned by the whole container, rather than giving each span group its own reference. We also have a little bit of duplication in the current data structure: the names are stored twice, once as keys and again as an attribute of the SpanGroup. This information can get out-of-sync, because we have it twice.

The doc.create_span_group is ugly

You can't create a SpanGroup without passing in the Doc object, so I added this helper to add a new span group. It feels weird though.

We should have a span_groups field in the Doc.__init__, right?

It's a bit annoying because we'll have to pass in the span data in dict format. What we musn't do is give in to the temptation to do the evil (deprecated) tuple format for a span, like (label, start, end).

Graph

This is much more drafty, but I've done some initial work on it and I think it seems promising. Directed labelled graphs can represent pretty much anything (although you might need to transform it before you can query it in any practical way). spaCy's philosophy has previously been that directed graphs are too under-constrained, which doesn't let us build good APIs for linguistic annotations. Most linguistic annotations aren't arbitrary, they have specific structure, so we can build much tighter interfaces for them. For instance, the tree constraint works really well for about 98% of syntax, and letting a word have multiple heads makes the structure really difficult to use.

Still, sometimes you do want a graph, e.g. for predicate-argument structures. So we should have this type in the library. This is especially true because the Python ecosystem doesn't actually have many options for this. The networkx package is the most popular, but it's a pretty big library and I have doubts that a pure-Python implementation is well suited to our use-case. graph-tool looks much more appealing to me, but it also has a pretty wide scope, and I doubt there are plans to package it for pip (it seems really hard). I think we should have our own graph type and let people export to these other packages for stuff like visualisation, instead of having them as a dependency.

TODO

Implement
Implement serialization
Test span groups are preserved in Doc byte serialization
Test span groups are preserved in DocBin byte serialization
Finalize naming and usage
Write docs

Future work

Use SpanGroup for beam results?
Plan for coref?
Support SpanGroup in Matcher?
Document Graph and refine its API

spacy/tokens/_serialize.py

spacy/tokens/doc.pyx

svlandeg · 2021-01-11T12:49:50Z

My two cents:

Just from the name, I'd guess doc.spans would be a list, not a dict. On the other hand that will only surprise you once, and after that you can stop writing doc.span_groups, which feels inconsistent with the preference for brevity elsewhere in the API.

Agreed with the trade-off here, I think doc.spans will just be so much cleaner in everyone's code. And you remember it much more easily than doc.span_groups.

I suppose we could make doc.spans a property that iterates out all the spans in the doc.span_groups. Seems somewhat pointless though?

If we do call the dictionary doc.spans, we could have a function or a property doc.list_spans that lists all spans? But then each Span should probably have a reference to the original group it belonged to? I can imagine some use-cases preferring the "list view", but I don't think it's really necessary. We can probably just do without.

We could make it just a list, but I think it's nice to fetch the span group by name.

Should we make the SpanGroups data structure manage the subgroups?

I would keep the dictionary structure, I quite like it as such, and the API is more straightforward, as you argued.

We also have a little bit of duplication in the current data structure: the names are stored twice, once as keys and again as an attribute of the SpanGroup.

I wondered about this redundancy too. It's a little bit like the pipeline component names, that store their own name but then the Language object stores them by name as well. I worried about that before too, but I guess it usually doesn't happen that they go out-of-sync. At least I've never seen any major problems with that.

The doc.create_span_group is ugly

Alternatively, this function could be called add_span_group or add_spans (if we rename to doc.spans) and we could optionally give it a list of spans? So we'd have

span_group = doc.add_spans("hi", [Span(doc, 3, 4, label="bye")])

instead of

span_group = doc.create_span_group("hi")
span_group.append(Span(doc, 3, 4, label="bye"))

We should have a span_groups field in the Doc.__init__, right? It's a bit annoying because we'll have to pass in the span data in dict format.

What we musn't do is give in to the temptation to do the evil (deprecated) tuple format for a span, like (label, start, end).

Hmm yes, we really should avoid the tuple format. But then how to do this? We can't make a Span before we created the Doc?

honnibal · 2021-01-12T04:53:02Z

After some discussion, we're currently at the following answers:

Should it be doc.spans or doc.span_groups?

doc.spans

Should it be a list?

Nah.

Should we make the SpanGroups data structure manage the subgroups?

Make a very thin SpanGroups container that works just like a dict, but handles the serialization and allows assignment of a list of spans.

The doc.create_span_group is ugly

Kill it.

We should have a span_groups field in the Doc.init, right?

Ugh, maybe? Not currently implemented though.

…paCy into feature/spans-on-doc

Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.

Also allow `Span` string properties `label_` and `kb_id_` to be writable following #6696.

Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.

honnibal added 6 commits July 2, 2020 14:02

Draft out initial Spans data structure

863215f

Merge changes to Span

da4cee1

Initial span group commit

efd6f69

Basic span group support on Doc

7d3007f

Basic test for span group

5dc08e8

Compile span_group.pyx

d6591d7

honnibal added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Jan 8, 2021

honnibal changed the title ~~WIP: AllowDoc object to track named span groups~~ WIP: Allow Doc object to track named span groups Jan 9, 2021

honnibal added 8 commits January 10, 2021 00:40

Merge branch 'develop' into feature/spans-on-doc

56b6cbd

Draft addition of SpanGroup to DocBin

6a06039

Add deserialization for SpanGroup

fd87a9e

Add tests for serializing SpanGroup

a7e6311

Fix serialization of SpanGroup

dea31c2

Add EdgeC and GraphC structs

21536b5

Add draft Graph data structure

1e437b5

Compile graph

70d97ee

honnibal changed the title ~~WIP: Allow Doc object to track named span groups~~ WIP: Add SpanGroup and Graph container types to represent arbitrary annotations Jan 11, 2021

More work on Graph

5b2d4b3

svlandeg added the feat / coref Feature: Coreference resolution label Jan 11, 2021

svlandeg reviewed Jan 11, 2021

View reviewed changes

spacy/tokens/_serialize.py Outdated Show resolved Hide resolved

spacy/tokens/doc.pyx Outdated Show resolved Hide resolved

honnibal added 8 commits January 12, 2021 02:09

Update GraphC

06d853b

Upd graph

9ec4658

Fix walk functions

b9610a8

Let Graph take nodes and edges on construction

dbdd3be

Fix walking and getting

55c495d

Add graph tests

91cb0e6

Fix import

cb7a94d

Add module with the SpanGroups dict thingy

a6f83a8

Try to fix c++11 compilation

436a6e8

honnibal and others added 16 commits January 12, 2021 16:07

Fix test

145bbf6

Update DocBin

5705adf

Try to fix compilation

d36a5e9

Try to fix graph

cbe9414

Improve SpanGroup docstrings

99b2ea3

Add doc.spans to documentation

33ac558

Fix serialization

a73e104

Tidy up and add docs

92fc535

Update docs [ci skip]

f7eafbb

Add SpanGroup.has_overlap

4eb9bfe

WIP updated Graph API

87e41d0

Start testing new Graph API

51a48d3

Update Graph tests

fb32dec

Update Graph

4e02ebb

Add docstring

ea97568

Merge branch 'feature/spans-on-doc' of https://github.com/explosion/s…

dc5a950

…paCy into feature/spans-on-doc

honnibal changed the title ~~WIP: Add SpanGroup and Graph container types to represent arbitrary annotations~~ Add SpanGroup and Graph container types to represent arbitrary annotations Jan 14, 2021

honnibal merged commit f277bfd into develop Jan 14, 2021

ines deleted the feature/spans-on-doc branch January 30, 2021 08:54

This was referenced Mar 1, 2021

Native coref component #7243

Merged

Native coref component #7264

Closed

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request May 11, 2021

Make all Span attrs writable

a3a7c8d

Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request May 11, 2021

Make all Span attrs writable

49aeaae

Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.

adrianeboyd mentioned this pull request May 11, 2021

Make all Span attrs writable #8062

Merged

3 tasks

honnibal pushed a commit that referenced this pull request May 17, 2021

Make all Span attrs writable (#8062)

82fa81d

Also allow `Span` string properties `label_` and `kb_id_` to be writable following #6696.

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request May 19, 2021

Make all Span attrs writable (explosion#8062)

0a50363

Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.

svlandeg pushed a commit to svlandeg/spaCy that referenced this pull request May 26, 2021

Make all Span attrs writable (explosion#8062)

0227fbd

Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request May 31, 2021

Make all Span attrs writable (explosion#8062)

4404250

Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SpanGroup and Graph container types to represent arbitrary annotations #6696

Add SpanGroup and Graph container types to represent arbitrary annotations #6696

honnibal commented Jan 8, 2021 •

edited

Loading

svlandeg commented Jan 11, 2021

honnibal commented Jan 12, 2021

Add SpanGroup and Graph container types to represent arbitrary annotations #6696

Add SpanGroup and Graph container types to represent arbitrary annotations #6696

Conversation

honnibal commented Jan 8, 2021 • edited Loading

Proposal

SpanGroup

Example use-cases

Example usage

Implementation details

Decisions to debate

Graph

TODO

Future work

svlandeg commented Jan 11, 2021

honnibal commented Jan 12, 2021

honnibal commented Jan 8, 2021 •

edited

Loading