-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💫 Proposal: A component-based processing pipeline architecture via Doc._, Token._ and Span._ #1381
Comments
This is incredible and will solve a lot of problems. A lot of great thinking and design decisions. A couple reactions and questions formed by thinking about how I'd apply this to some use cases. As it stands one can only add to the pipe, e.g., there isn't In general I have to say the current state of this pipeline business is confusing. To me the lifecycle of a pipeline should be as simple as
Does that mean user_hooks and its ilk would (hopefully) be jettisoned? I really like ._ for the fact that it lives at the same level as spacy's variables. As you well know, it flies in the face of PEP though. A bit of mental dissonance and cause for confusion for those of us that abide by that rule. Also, it still allows for collisions amongst extensions. Why not just use a transparent .my_extension namespace? Similar to python's name mangling with __var to prevent namespace collisions. It seems some clever use of |
Thanks for your feedback!
The tokenizer is a "special" pipeline component in the sense that it takes a different input – text. That's also the reason it's not part of the nlp.tokenizer = MyCustomTokenizer(nlp.vocab) We do think that there's a point in keeping those things simple. If something is an object you can overwrite, you should be able to do so. Just like you'll still be able to append stuff to
Good point – didn't add this in my proposal, but those methods should definitely exist. Similarly, there should probably be a
I hope we'll be able to make this less confusing in the new documentation! I think a lot of the design around this comes down to how the models work under the hood, and how we've been moving towards making them more transparent (since there are now many different models with different features and trade-offs instead of just one "the model"). It probably also doesn't help that the A model = weights (binary data) + pipeline + language data. The pipeline applied when you call cls = util.get_lang_class(lang)
nlp = cls(pipeline=pipeline)
nlp.from_disk(model_data_path) There's actually very little going on at the
I'm personally not a huge fan of the
Interesting! I remember talking about this with @honnibal and how other libraries are doing similar things with class Doc(object):
pass
class Underscore(object):
pass
Doc._ = Underscore()
Doc._.foo = 'bar' Or is there something I'm missing?
You mean Namespace collisions are a valid concern though – but also a problem that many other applications with a third-party extension ecosystem have solved before us. So I'm pretty confident we can find a good solution for this. For developers looking to publish spaCy extensions, the recommended best practices should also include an option to allow the user to overwrite the attributes the extension is setting (which seems pretty standard for plugins in general). This way, if two extensions happen to clash (or if the user disagrees with the developer's naming preferences 😉), it's easy to fix. |
I think your model = weights (binary data) + pipeline + language data is a great starting point. Then recursively going into each and explaining how their made and manipulated would be great. As you say, it is mostly a documentation and naming issue. Maybe there should be an advanced usage (what is happening under the hood) section for each component for those wanting to delve deeper/train models/create pipelines/etc.? Having too many options to do the same thing can be confusing.
Do you know which ones? I'd be curious to have a look at them.
I wouldn't go so far as saying that _ is an invalid variable name. It is just _ is a special character in python and you are coopting it for another use- defining a public namespace. I guess I don't see how _ is better than, say, Looking forward to seeing this in action! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Related issues: #1085, #1105, #860
CC: @honnibal, @Liebeck, @christian-storm
Motivation
Custom processing pipelines are a very powerful feature of spaCy that will be able to solve many problems people are currently having when making NLP work for their specific use case. So for spaCy v2.0, we've been working on improving the processing pipelines architecture and extensibility. Fundamentally, a pipeline is a list of functions called on a
Doc
in order. The pipeline can be set by a model, and modified by the user. A pipeline component can be a complex class that holds state, or a very simple Python function that adds something to aDoc
and returns it. However, even with the current state of proposed improvements, the pipelines still aren't perfect and as user-friendly as they should be.If it's easier to write custom data to the
Doc
,Token
andSpan
, applications using spaCy will be able to take full advantage of the built-in data structures and the benefits ofDoc
objects as the single source of truth containing all information. Instead of mixingDoc
objects, arrays, plain text and other structures, applications could only pass aroundDoc
objects and read from and write to them whenever necessary.Having a straightforward API for custom extensions and a clearly defined input/output (
Doc
/Doc
) also helps making larger code bases more maintainable, and allows developers to share their extensions with others and test them reliably. This is relevant for teams working with spaCy, but also for developers looking to publish their own packages, extensions and plugins.The spaCy philosophy has always been to focus on providing one, best-possible implementation, instead of adopting a "broad church" approach, which makes a lot of sense for research libraries, but can be potentially dangerous for libraries aimed at production use. Going forward, I believe the best future-proof strategy is to direct our efforts at making the processing pipeline more transparent and extensible, and encouraging a community ecosystem of spaCy components to cover any potential use case – no matter how specific. Components could range from simple extensions adding fairly trivial attributes for convenience, to complex models making use of external libraries such as PyTorch, scikit-learn and TensorFlow.
There are many components users may want, and we'd love to be able to offer more built-in pipeline components shipped with spaCy (e.g. SBD, SRL, coref, sentiment). But there's also a clear need for making spaCy extensible for specific use cases, making it interoperate better with other libraries, and putting all of it together to update and train statistical models (the other big issue we're tackling with v2.0).
TL;DR
Doc._
,Token._
andSpan._
attribute users can write to, choosing any custom namespace.Underscore
class will wire it all together and resolve the custom properties for tokens and spans, which are only views of theDoc
.Language.add_pipe
method to add pipeline components, with options to specify the pipeline IDs to add the component before/after, and aLanguage.replace_pipeline
method to replace the entire pipeline.Language.pipe_names
property that returns a list of the pipeline IDs (e.g.['tensorizer', 'ner']
) as a human-readable version ofLanguage.pipeline
.Pipe
base class used by spaCy for its built-in components like the tagger, parser and entity recognizer.Why
._
?Letting the user write to a
._
attribute instead of to theDoc
directly keeps a clearer separation and makes it easier to ensure backwards compatibility. For example, if you've implemented your own.coref
property and spaCy claims it one day, it'll break your code. Similarly, as we have more and more production users with sizable code bases, this solution will make it much easier to tell what's built-in and what's custom. Just by looking at the code, you'll immediately know thatdoc.sentiment
is spaCy, anddoc._.sent_score
isn't.Doc._
is shorter and more distinct thanDoc.user_data
, and for the lack of better options in Python, the_
seems like the best choice. (It's also kinda cute...doc._.doc
... once you see the face, you can't unsee it. Just like I'll never be able to readdoc.cats
as "doc dot categories" again 😺)Custom pipeline components
Pipeline components can write to a
Doc
,Span
orToken
's_
attribute, which is resolved internally via anUnderscore
class. In the case ofSpan
andToken
, this means resolving it relative to the respective indices, as they are only views of aDoc
. A pipeline component can hold any state, take the sharedVocab
if needed, and implement its own getters and setters.A component added to the pipeline needs to be a callable that takes a
Doc
, modifies it and returns it. Here's a simple example of a component wrapper that takes arbitrary settings and assigns "something" to aDoc
andToken
:The custom component could then be initialised and used like this:
add_pipe()
would offer a more convenient way of adding to the pipeline thanpipeline.append()
or overwriting the pipeline, which easily gets messy, as you have to know the names and order of components, or at least the index at which to insert the new component. Thebefore
andafter
keyword arguments can specify one or more IDs to insert the component before/after (which will be resolved accordingly, and raise an error if the positioning is impossible).When the pipeline is applied, the custom attribute is available via
._
:This system would also allow adding custom
Doc
,Token
andSpan
methods, similar to the built-insimilarity()
.A model can either require the component package as a dependency, or ship the component code as part of the model package. It can then be added to the pipeline in the model's
__init__.py
:Alternatively, a trainable and fully serializable custom pipeline component could also be implemented via the
Pipe
base class, which is used for spaCy's built-in pipeline components like the tagger, parser and entity recognizer in v2.0.Going forward, we can even take this architecture one step further and allow other applications to register spaCy pipeline components via entry points, that would make them available via their name.
New classes, methods and properties
Language.pipe_names
(property)Returns a list of pipeline component IDs in order. Useful to check the current pipeline, and determine where to insert custom components.
Language.add_pipe
(method)Add a component to the pipeline.
component
name
component.name
.before
after
Language.replace_pipeline
(method)Replace the pipeline.
pipeline
Underscore
(class)Resolves
Doc._
,Span._
andToken._
set by the user.The pipeline component ecosystem
The processing pipeline outlined in this proposal is a good fit for a component-based ecosystem, as pipeline components would have the following features: a lifecycle, an isolated scope and a standardised API.
Component-based ecosystems can be very powerful in driving forward community contributions, while at the same time, keeping the core library focussed and compact. We're obviously happy to integrate third-party components into the core if they're a good fit, but we also want developers to be able to take ownership of their extensions, write spaCy wrappers for their libraries and implement any logic they need quickly, without having to worry about the grand scheme of things.
If you're the maintainer of a library and want to integrate it with spaCy, you'd be able to offer a simple pipeline component your users could plug in and use. Your installation instructions would be as simple as: Install the package, initialise it with your settings and add it to your pipeline using
nlp.add_pipe()
. Your extension can claim its own._
namespace on theDoc
,Token
andSpan
.Production users with large code bases would be able to manage their spaCy extensions and utilities as packages that can be developed and integrated into CI workflows independently.
Aside from the obvious use case of implementing models and missing text processing features, there are many other, creative ways in which pipeline component extensions can be utilised – for example:
In terms of the community strategy around this, a possible approach could be:
spacy-contrib
orspacy-extensions
package like some other libraries do would be a good solution for us. Extensions are very specific, and users shouldn't have to install a bunch of stuff they don't need just to use one particular component. Versioning packages like this is also a nightmare. Similarly, I don't think we should make people submit them to an "official" repository – if someone made a spaCy extension, they should be able to showcase their work on their own GitHub profiles.The text was updated successfully, but these errors were encountered: