-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Support adding pipeline component instances #12710
base: v4
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely see what you're aiming to do here, and I know that the config system is occassionally a bit of a hurdle for users, so it makes sense to think of alternative ways that don't rely on the config so much.
The "Example usage 1" to me is crucial to understand how this functionality would/should look like in terms of UX. The code that you cited in the PR description won't actually work, because DEFAULT_TAGGER_MODEL
is a Config
object (aka a dict
), not an actual Thinc Model
.
If I'm not mistaken, you'd have to do something like this instead:
tok2vec = build_hash_embed_cnn_tok2vec(pretrained_vectors=None, width=96, depth=4, embed_size=2000, window_size=1, maxout_pieces=3, subword_features=True)
model = build_tagger_model(tok2vec=tok2vec)
tagger = Tagger(nlp.vocab, model=model, overwrite=False, scorer=tagger_score, neg_prefix="!", label_smoothing=0.0)
Clearly this isn't user-friendly either, because we've now lost a way to just go with the default settings when defining the tok2vec. Also, if we want to better support this approach, we'll need to better document the functions underlying the registry names (right now docs focus on "spacy.HashEmbedCNN.v2"
and don't even mention build_hash_embed_cnn_tok2vec
)
Alternatively, we'd have something like this:
RESOLVED_TAGGER_MODEL = registry.resolve(Config().from_str(default_model_config))["model"]
which would be an actual Thinc model, and something we can provide as part of the core library. But it doesn't really address the core issue to me, which is that if you want to go in and change settings of this tagger model, you'd still end up having to edit the default_model_config
, so you'd really still be working with the config system.
We'd probably always run into this kind of config dependency with the default settings, because we've decided for v3 to record all of the default values in these default model config strings (representing dicts), and we explicitely didn't want to repeat them in the functions themselves (like in build_hash_embed_cnn_tok2vec
) to avoid this constant overwriting of defaults we had before... 🤔
To continue a bit on the usage 1 example, this could become problematic inbetween spaCy versions, no? If we'd change the default tagger model, this |
I used a tagger for this, which is admittedly a bit of a fiddly use-case for it. Assembling a matcher or something by hand might be more common, or if you're just putting together some arbitrary components.
These are good points to keep in mind. I would suggest getting the core functionality in first, and then at least theres some way to do this, even if the support for the overall workflow in the docs and the library is shaky. We don't necessarily need to document every one of these registered functions. What we do need are good ways for users to navigate from registry names to the location in the source code that the name resolves to. The VSCode extension I think does this, right? We need something like that on the command-line as well.
Well, let's say people go and reference undocumented stuff with this workflow. Yes, they can have their things break from this between versions. That's a risk that they should take in a well-informed way. Trying to prevent every mistake or shield users from potential problems can be at the expense of flexibility. If we ask what we expect the Python API to be able to do, from a simple flexibility standpoint, to me it definitely should be able to add components by instance. |
So which use-cases are we really aiming to support here? The matcher doesn't seem applicable either - at least the built-in If we're thinking about stateless functions like the
I can imagine other use-cases where you want to initiate the component class directly from within code though, like for
Like with the Does the above code feel like a satisfying solution to offer to users? Or do we need to rethink again how we define these default values? I just want to understand the level of support we want to offer here.
Agreed - this would be good functionality to have! |
@@ -675,6 +675,7 @@ def load_model_from_init_py( | |||
enable: Union[str, Iterable[str]] = _DEFAULT_EMPTY_PIPES, | |||
exclude: Union[str, Iterable[str]] = _DEFAULT_EMPTY_PIPES, | |||
config: Union[Dict[str, Any], Config] = SimpleFrozenDict(), | |||
pipe_instances: Dict[str, Any] = SimpleFrozenDict(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not having this in load_model_from_init_py
broke nlp = spacy.load("en_core_web_sm")
, which wasn't caught by the test suite :|
I think it's better than nothing, yeah. I'm interested in whether there's a way we can somehow provide the defaults in Python without duplicating them in the configs. Perhaps we could inspect the function to build the default config? Anyway we can leave that for a separate question. I'd still want If people build a suite of their own functions and things, they can manage their own defaults, and they should be able to express a reasonable Python API for assembling a pipeline without having to go through our config system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could overload add_pipe, so that you can just pass the pipe instance in there. I've kept this separate to avoid interaction with existing features, at least until we're happy with it. In the long term I think the overload is a bit more consistent with the general design of the API.
I'm personally not a huge fan of overloading, for type & clarity reasons. For now, I think it's fine as a separate method.
We could support pipe metadata in add_pipe_instance, but then how would we serialise it? I felt it was better to avoid the complication, and if you want those things to be correct you can assemble the pipeline via the config system. I could easily be missing implications of this metadata not being there though.
This meta data typically contains the defaults from the factory, so it seems irrelevant in this case?
Naming of pipe_instance argument?
I can't think of a better one...
Fishing the vocab instance out of the pipe instances seems dubious. Probably we should just not do this, and ask that the vocab argument be set?
Should we check for vocab instances more robustly? We could go through the dict, but that still doesn't cover nested objects. I felt that the vocab member was a decent value thing to check, albeit not watertight.
I think the _get_instantiated_vocab()
method looks reasonable as-is and will provide useful warnings when there's conflicts.
self._pipe_meta[name] = self.get_factory_meta(factory_name) | ||
pipe_index = self._get_pipe_index(before, after, first, last) | ||
self._components.insert(pipe_index, (name, pipe_component)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can revert the changes to add_pipe
in this PR, if we're not overloading.
@@ -32,6 +32,7 @@ def load( | |||
enable: Union[str, Iterable[str]] = util._DEFAULT_EMPTY_PIPES, | |||
exclude: Union[str, Iterable[str]] = util._DEFAULT_EMPTY_PIPES, | |||
config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(), | |||
pipe_instances: Dict[str, Any] = util.SimpleFrozenDict(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipe_instances: Dict[str, Any] = util.SimpleFrozenDict(), | |
pipe_instances: Dict[str, PipeCallable] = util.SimpleFrozenDict(), |
cf the type we're using in add_pipe_instance()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for the other methods, too.
spacy/language.py
Outdated
where you provide a config and let spaCy construct the instance). See 'spacy.load' | ||
for details of how to load back a pipeline with components added by instance. | ||
|
||
pipe_instance (Callable[[Doc], Doc]): The component to add. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipe_instance (Callable[[Doc], Doc]): The component to add. | |
component (Callable[[Doc], Doc]): The component to add. |
last (bool): If True, insert component last in the pipeline. | ||
RETURNS (Callable[[Doc], Doc]): The pipeline component. | ||
|
||
DOCS: https://spacy.io/api/language#add_pipe_instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add documentation to API page
name = name if name is not None else getattr(component, "name") | ||
if name is None: | ||
raise ValueError("TODO error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could require name
to be a str
in this method, and not Optional
.
if pipe_name not in pipeline: | ||
opts = ", ".join(pipeline.keys()) | ||
raise ValueError(Errors.E956.format(name=pipe_name, opts=opts)) | ||
if pipe_name in pipe_instances: | ||
if pipe_name in exclude: | ||
continue | ||
else: | ||
nlp.add_pipe_instance(pipe_instances[pipe_name]) | ||
# Is it important that we instantiate pipes that | ||
# aren't excluded? It seems like we would want | ||
# the exclude check above. I've left it how it | ||
# is though, in case there's some sort of crazy | ||
# load-bearing side-effects someone is relying on? | ||
pipe_cfg = util.copy_config(pipeline[pipe_name]) | ||
raw_config = Config(filled["components"][pipe_name]) | ||
if pipe_name not in exclude: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it looks like we should move the exclude
check to be the first thing in the for
loop - we specifically say that excluded components won't be loaded.
Let's move that to a separate PR though?
spacy/errors.py
Outdated
@@ -219,6 +219,9 @@ class Warnings(metaclass=ErrorsWithCodes): | |||
W125 = ("The StaticVectors key_attr is no longer used. To set a custom " | |||
"key attribute for vectors, configure it through Vectors(attr=) or " | |||
"'spacy init vectors --attr'") | |||
W126 = ("Pipe instance '{name}' is being added with a vocab " | |||
"instance that will not match other components. This is " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"instance that will not match other components. This is " | |
"instance that does not match other components. This is " |
Add a method
Language.add_pipe_instance
that allows you to add component instances to the pipeline. This is primarily intended to make it easier to assemble a pipeline dynamically in a script, which is useful for exploration or for situations where you don't care so much about serialisation.Components that are added this way cannot be reconstructed automatically during deserialisation. Instead, a
pipe_instances
argument is added tospacy.load()
.Example usage 1: creating an NLP object with a function, and using
nlp.to_disk()
andnlp.from_disk()
.If users wish to assemble the pipeline themselves in a function, they don't need to worry about the config system. They just have to make sure they use the same function to construct the pipeline each time.
Example usage 2: Passing component instances to
spacy.load()
If pipe instances are added directly, we won't be able to reconstruct them in
spacy.load()
. Instead of just making the pipeline impossible to load this way, you can pass the missing instances using thepipe_instances
argument. This approach is kind of impractical for components that need theVocab
instance, but I definitely think it's better than just not supporting it.I added some logic to fetch the
vocab
instance from a pipe instance, so thatnlp = spacy.load("/tmp/my_nlp", pipe_instances={"tagger": my_tagger})
would work. The value of that logic is a bit dubious.Questions for discussion
add_pipe
, so that you can just pass the pipe instance in there. I've kept this separate to avoid interaction with existing features, at least until we're happy with it. In the long term I think the overload is a bit more consistent with the general design of the API.add_pipe_instance
, but then how would we serialise it? I felt it was better to avoid the complication, and if you want those things to be correct you can assemble the pipeline via the config system. I could easily be missing implications of this metadata not being there though.pipe_instance
argument?vocab
instance out of the pipe instances seems dubious. Probably we should just not do this, and ask that thevocab
argument be set?vocab
instances?vocab
instances more robustly? We could go through the__dict__
, but that still doesn't cover nested objects. I felt that thevocab
member was a decent value thing to check, albeit not watertight.Checklist