-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SpanGroup and Graph container types to represent arbitrary annotations #6696
Conversation
My two cents:
Agreed with the trade-off here, I think
If we do call the dictionary
I would keep the dictionary structure, I quite like it as such, and the API is more straightforward, as you argued.
I wondered about this redundancy too. It's a little bit like the pipeline component names, that store their own name but then the
Alternatively, this function could be called
instead of
Hmm yes, we really should avoid the tuple format. But then how to do this? We can't make a |
After some discussion, we're currently at the following answers:
Nah.
Make a very thin
Kill it.
Ugh, maybe? Not currently implemented though. |
…paCy into feature/spans-on-doc
Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.
Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.
Also allow `Span` string properties `label_` and `kb_id_` to be writable following #6696.
Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.
Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.
Also allow `Span` string properties `label_` and `kb_id_` to be writable following explosion#6696.
Proposal
The
Doc
object has specific "slots" for the core annotations, which are heavily constrained for both efficiency and API simplicity. The only way to store arbitrary annotations has been to place data indoc.user_dict
and then access it via the extension attribute system.This PR provides native support for two additional container types, for more flexible type of information storage.
SpanGroup
: A sequence of labelled spans.Graph
: A sequence of labelled, directed relations between sets of tokens. The nodes of the graph (the tokens) don't have to form contiguous spans. Nodes can also be empty, allowing arbitrary labels to be attached to the token groups.Two new attributes are added to the
Doc
that components can use to store their annotations:doc.spans
(`Dict[str, SpanGroup])doc.graphs
(Dict[str, Graph])Pipeline components could then be configured with a string under which to store their annotations. For instance, we expect to add a built-in coreference coreference component. With its default configuration, it would write to
doc.spans["coref"]
. An alternative coreference component could be configured to write to the same key, or a different one if you want to store annotations from both.SpanGroup
The new
SpanGroup
is a named list ofSpan
objects. Arbitrary json-serializable attributes can also be attached to theSpanGroup
. TheDoc
object is given a new dict attribute,doc.span_groups
, whose keys are strings and whose values areSpanGroup
objects.Example use-cases
Example usage
Implementation details
The main trick here is avoiding reference cycles. The
Span
object holds a reference to theDoc
, so we don't want theDoc
object to hold references to actualSpan
objects. Otherwise, the reference counting won't be able to free theDoc
(its count will never drop to zero), and we'll have to rely on the garbage collection. Relying on garbage collection is bad: it means the memory accumulates, it introduces pauses, and it makes destructors very difficult to reason about (because you don't know when the destructor will be called). It's especially problematic for managing GPU memory, because the garbage collection is triggered by memory pressure, which doesn't consider pressure is on GPU resources.To avoid the reference cycles, the
SpanGroup
object owns a weakref to theDoc
, which doesn't increase the reference count, and stores the span data using avector[SpanC]
. This required a small refactor to theSpan
object to make it use theSpanC
object to hold its internal data.An alternative to the weakref would be to require the
Doc
to be passed in explicitly when fetching data back out of theSpanGroup
. This would stop us from having the span group work like a list; we couldn't havespan = span_group[i]
. We would need to have something likespan = span_group.get(doc, i)
, which isn't very nice imo.Decisions to debate
doc.spans
ordoc.span_groups
?Just from the name, I'd guess
doc.spans
would be a list, not a dict. On the other hand that will only surprise you once, and after that you can stop writingdoc.span_groups
, which feels inconsistent with the preference for brevity elsewhere in the API.I suppose we could make
doc.spans
a property that iterates out all the spans in thedoc.span_groups
. Seems somewhat pointless though?We could make it just a list, but I think it's nice to fetch the span group by name.
SpanGroups
data structure manage the subgroups?We could have some more complex container that has an API for giving out all of the span groups and then named subgroups of them as well. So it would be like
doc.spans.group("coref")
or something. I think this will send people to the docs all the time to remember the API specifics? A dict is only a little bit less concise, and it feels so much simpler.There are some arguments for making an object to manage the whole dict though. One is that the whole dict needs to be serialized together; we currently use some dict comprehensions for this. Another is that we could have one weakref owned by the whole container, rather than giving each span group its own reference. We also have a little bit of duplication in the current data structure: the names are stored twice, once as keys and again as an attribute of the
SpanGroup
. This information can get out-of-sync, because we have it twice.doc.create_span_group
is uglyYou can't create a
SpanGroup
without passing in theDoc
object, so I added this helper to add a new span group. It feels weird though.span_groups
field in theDoc.__init__
, right?It's a bit annoying because we'll have to pass in the span data in dict format. What we musn't do is give in to the temptation to do the evil (deprecated) tuple format for a span, like
(label, start, end)
.Graph
This is much more drafty, but I've done some initial work on it and I think it seems promising. Directed labelled graphs can represent pretty much anything (although you might need to transform it before you can query it in any practical way). spaCy's philosophy has previously been that directed graphs are too under-constrained, which doesn't let us build good APIs for linguistic annotations. Most linguistic annotations aren't arbitrary, they have specific structure, so we can build much tighter interfaces for them. For instance, the tree constraint works really well for about 98% of syntax, and letting a word have multiple heads makes the structure really difficult to use.
Still, sometimes you do want a graph, e.g. for predicate-argument structures. So we should have this type in the library. This is especially true because the Python ecosystem doesn't actually have many options for this. The
networkx
package is the most popular, but it's a pretty big library and I have doubts that a pure-Python implementation is well suited to our use-case. graph-tool looks much more appealing to me, but it also has a pretty wide scope, and I doubt there are plans to package it for pip (it seems really hard). I think we should have our own graph type and let people export to these other packages for stuff like visualisation, instead of having them as a dependency.TODO
Doc
byte serializationDocBin
byte serializationFuture work
SpanGroup
for beam results?SpanGroup
inMatcher
?Graph
and refine its API