Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support adding pipeline component instances #12710

Open
wants to merge 16 commits into
base: v4
Choose a base branch
from

Conversation

honnibal
Copy link
Member

Add a method Language.add_pipe_instance that allows you to add component instances to the pipeline. This is primarily intended to make it easier to assemble a pipeline dynamically in a script, which is useful for exploration or for situations where you don't care so much about serialisation.

Components that are added this way cannot be reconstructed automatically during deserialisation. Instead, a pipe_instances argument is added to spacy.load().

Example usage 1: creating an NLP object with a function, and using nlp.to_disk() and nlp.from_disk().

If users wish to assemble the pipeline themselves in a function, they don't need to worry about the config system. They just have to make sure they use the same function to construct the pipeline each time.

def make_nlp():
    nlp = spacy.blank("en")
    tagger = Tagger(nlp.vocab, model=DEFAULT_TAGGER_MODEL, overwrite=False, scorer=tagger_score, neg_prefix="!", label_smoothing=0.0)
    nlp.add_pipe_instance(tagger)
    return nlp

nlp = make_nlp()
nlp.to_disk("/tmp/my_nlp")
# Later
nlp = make_nlp()
nlp.from_disk("/tmp/my_nlp")

Example usage 2: Passing component instances to spacy.load()

If pipe instances are added directly, we won't be able to reconstruct them in spacy.load(). Instead of just making the pipeline impossible to load this way, you can pass the missing instances using the pipe_instances argument. This approach is kind of impractical for components that need the Vocab instance, but I definitely think it's better than just not supporting it.

vocab = Vocab()
tagger = Tagger(vocab, model=DEFAULT_TAGGER_MODEL, overwrite=False, scorer=tagger_score, neg_prefix="!", label_smoothing=0.0)
nlp = spacy.load("/tmp/my_nlp", vocab=vocab, pipe_instances={"tagger": my_tagger})

I added some logic to fetch the vocab instance from a pipe instance, so that nlp = spacy.load("/tmp/my_nlp", pipe_instances={"tagger": my_tagger}) would work. The value of that logic is a bit dubious.

Questions for discussion

  • We could overload add_pipe, so that you can just pass the pipe instance in there. I've kept this separate to avoid interaction with existing features, at least until we're happy with it. In the long term I think the overload is a bit more consistent with the general design of the API.
  • We could support pipe metadata in add_pipe_instance, but then how would we serialise it? I felt it was better to avoid the complication, and if you want those things to be correct you can assemble the pipeline via the config system. I could easily be missing implications of this metadata not being there though.
  • Naming of pipe_instance argument?
  • Fishing the vocab instance out of the pipe instances seems dubious. Probably we should just not do this, and ask that the vocab argument be set?
  • Wording of the warnings about vocab instances?
  • Should we check for vocab instances more robustly? We could go through the __dict__, but that still doesn't cover nested objects. I felt that the vocab member was a decent value thing to check, albeit not watertight.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components labels Jun 10, 2023
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely see what you're aiming to do here, and I know that the config system is occassionally a bit of a hurdle for users, so it makes sense to think of alternative ways that don't rely on the config so much.

The "Example usage 1" to me is crucial to understand how this functionality would/should look like in terms of UX. The code that you cited in the PR description won't actually work, because DEFAULT_TAGGER_MODEL is a Config object (aka a dict), not an actual Thinc Model.

If I'm not mistaken, you'd have to do something like this instead:

tok2vec = build_hash_embed_cnn_tok2vec(pretrained_vectors=None, width=96, depth=4, embed_size=2000, window_size=1, maxout_pieces=3, subword_features=True)
model = build_tagger_model(tok2vec=tok2vec)
tagger = Tagger(nlp.vocab, model=model, overwrite=False, scorer=tagger_score, neg_prefix="!", label_smoothing=0.0)

Clearly this isn't user-friendly either, because we've now lost a way to just go with the default settings when defining the tok2vec. Also, if we want to better support this approach, we'll need to better document the functions underlying the registry names (right now docs focus on "spacy.HashEmbedCNN.v2" and don't even mention build_hash_embed_cnn_tok2vec)

Alternatively, we'd have something like this:

RESOLVED_TAGGER_MODEL = registry.resolve(Config().from_str(default_model_config))["model"]

which would be an actual Thinc model, and something we can provide as part of the core library. But it doesn't really address the core issue to me, which is that if you want to go in and change settings of this tagger model, you'd still end up having to edit the default_model_config, so you'd really still be working with the config system.

We'd probably always run into this kind of config dependency with the default settings, because we've decided for v3 to record all of the default values in these default model config strings (representing dicts), and we explicitely didn't want to repeat them in the functions themselves (like in build_hash_embed_cnn_tok2vec) to avoid this constant overwriting of defaults we had before... 🤔

@svlandeg
Copy link
Member

They just have to make sure they use the same function to construct the pipeline each time.

To continue a bit on the usage 1 example, this could become problematic inbetween spaCy versions, no? If we'd change the default tagger model, this make_nlp function bypasses the versioning and could break or result in different results?

@honnibal
Copy link
Member Author

The "Example usage 1" to me is crucial to understand how this functionality would/should look like in terms of UX.

I used a tagger for this, which is admittedly a bit of a fiddly use-case for it. Assembling a matcher or something by hand might be more common, or if you're just putting together some arbitrary components.

Clearly this isn't user-friendly either, because we've now lost a way to just go with the default settings when defining the tok2vec. Also, if we want to better support this approach, we'll need to better document the functions underlying the registry names (right now docs focus on "spacy.HashEmbedCNN.v2" and don't even mention build_hash_embed_cnn_tok2vec)

These are good points to keep in mind. I would suggest getting the core functionality in first, and then at least theres some way to do this, even if the support for the overall workflow in the docs and the library is shaky.

We don't necessarily need to document every one of these registered functions. What we do need are good ways for users to navigate from registry names to the location in the source code that the name resolves to. The VSCode extension I think does this, right? We need something like that on the command-line as well.

To continue a bit on the usage 1 example, this could become problematic in between spaCy versions, no? If we'd change the default tagger model, this make_nlp function bypasses the versioning and could break or result in different results?

Well, let's say people go and reference undocumented stuff with this workflow. Yes, they can have their things break from this between versions. That's a risk that they should take in a well-informed way.

Trying to prevent every mistake or shield users from potential problems can be at the expense of flexibility. If we ask what we expect the Python API to be able to do, from a simple flexibility standpoint, to me it definitely should be able to add components by instance.

@svlandeg
Copy link
Member

I used a tagger for this, which is admittedly a bit of a fiddly use-case for it. Assembling a matcher or something by hand might be more common, or if you're just putting together some arbitrary components.

So which use-cases are we really aiming to support here? The matcher doesn't seem applicable either - at least the built-in Matcher can't actually be used as a pipeline component, as it returns a list of matches on __call__, not the annotated docs like other components do.

If we're thinking about stateless functions like the evil_component test case added here, it feels like a small effort to add a one-liner to allow for nlp.add_pipe("evil_instance"):

@Language.component("evil_instance")
def evil_component(doc):
    ...

I can imagine other use-cases where you want to initiate the component class directly from within code though, like for spacy-llm's (v0.3.0) LLM component:

from spacy_llm.pipeline import LLMWrapper
from spacy_llm.tasks import NERTask
from spacy_llm.backends.rest.openai import OpenAIBackend
from spacy_llm.cache import BatchCache

cache = BatchCache(path=None, batch_size=342, max_batches_in_mem=42)
backend = OpenAIBackend(config={"model": "text-davinci-003"}, strict=False, max_tries=4, interval=1.0, max_request_time=40)
instance = LLMWrapper(vocab=nlp.vocab, task=NERTask, backend=backend, cache=cache, save_io=False)

nlp.add_pipe_instance(instance, name="my_llm")

Like with the tagger use-case I tried to work through in my first example, you end up specifying all values of the Python objects you need to use, because we're currently handling all the defaults through the registered functions & the config.

Does the above code feel like a satisfying solution to offer to users? Or do we need to rethink again how we define these default values? I just want to understand the level of support we want to offer here.

We don't necessarily need to document every one of these registered functions. What we do need are good ways for users to navigate from registry names to the location in the source code that the name resolves to. The VSCode extension I think does this, right? We need something like that on the command-line as well.

Agreed - this would be good functionality to have!

@@ -675,6 +675,7 @@ def load_model_from_init_py(
enable: Union[str, Iterable[str]] = _DEFAULT_EMPTY_PIPES,
exclude: Union[str, Iterable[str]] = _DEFAULT_EMPTY_PIPES,
config: Union[Dict[str, Any], Config] = SimpleFrozenDict(),
pipe_instances: Dict[str, Any] = SimpleFrozenDict(),
Copy link
Member

@svlandeg svlandeg Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having this in load_model_from_init_py broke nlp = spacy.load("en_core_web_sm"), which wasn't caught by the test suite :|

@honnibal
Copy link
Member Author

Does the above code feel like a satisfying solution to offer to users? Or do we need to rethink again how we define these default values? I just want to understand the level of support we want to offer here.

I think it's better than nothing, yeah.

I'm interested in whether there's a way we can somehow provide the defaults in Python without duplicating them in the configs. Perhaps we could inspect the function to build the default config? Anyway we can leave that for a separate question. I'd still want add_pipe_instance even if we don't.

If people build a suite of their own functions and things, they can manage their own defaults, and they should be able to express a reasonable Python API for assembling a pipeline without having to go through our config system.

Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could overload add_pipe, so that you can just pass the pipe instance in there. I've kept this separate to avoid interaction with existing features, at least until we're happy with it. In the long term I think the overload is a bit more consistent with the general design of the API.

I'm personally not a huge fan of overloading, for type & clarity reasons. For now, I think it's fine as a separate method.

We could support pipe metadata in add_pipe_instance, but then how would we serialise it? I felt it was better to avoid the complication, and if you want those things to be correct you can assemble the pipeline via the config system. I could easily be missing implications of this metadata not being there though.

This meta data typically contains the defaults from the factory, so it seems irrelevant in this case?

Naming of pipe_instance argument?

I can't think of a better one...

Fishing the vocab instance out of the pipe instances seems dubious. Probably we should just not do this, and ask that the vocab argument be set?

Should we check for vocab instances more robustly? We could go through the dict, but that still doesn't cover nested objects. I felt that the vocab member was a decent value thing to check, albeit not watertight.

I think the _get_instantiated_vocab() method looks reasonable as-is and will provide useful warnings when there's conflicts.

self._pipe_meta[name] = self.get_factory_meta(factory_name)
pipe_index = self._get_pipe_index(before, after, first, last)
self._components.insert(pipe_index, (name, pipe_component))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can revert the changes to add_pipe in this PR, if we're not overloading.

@@ -32,6 +32,7 @@ def load(
enable: Union[str, Iterable[str]] = util._DEFAULT_EMPTY_PIPES,
exclude: Union[str, Iterable[str]] = util._DEFAULT_EMPTY_PIPES,
config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(),
pipe_instances: Dict[str, Any] = util.SimpleFrozenDict(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pipe_instances: Dict[str, Any] = util.SimpleFrozenDict(),
pipe_instances: Dict[str, PipeCallable] = util.SimpleFrozenDict(),

cf the type we're using in add_pipe_instance()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for the other methods, too.

where you provide a config and let spaCy construct the instance). See 'spacy.load'
for details of how to load back a pipeline with components added by instance.

pipe_instance (Callable[[Doc], Doc]): The component to add.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pipe_instance (Callable[[Doc], Doc]): The component to add.
component (Callable[[Doc], Doc]): The component to add.

last (bool): If True, insert component last in the pipeline.
RETURNS (Callable[[Doc], Doc]): The pipeline component.

DOCS: https://spacy.io/api/language#add_pipe_instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add documentation to API page

Comment on lines +867 to +869
name = name if name is not None else getattr(component, "name")
if name is None:
raise ValueError("TODO error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could require name to be a str in this method, and not Optional.

spacy/language.py Outdated Show resolved Hide resolved
Comment on lines 1901 to 1916
if pipe_name not in pipeline:
opts = ", ".join(pipeline.keys())
raise ValueError(Errors.E956.format(name=pipe_name, opts=opts))
if pipe_name in pipe_instances:
if pipe_name in exclude:
continue
else:
nlp.add_pipe_instance(pipe_instances[pipe_name])
# Is it important that we instantiate pipes that
# aren't excluded? It seems like we would want
# the exclude check above. I've left it how it
# is though, in case there's some sort of crazy
# load-bearing side-effects someone is relying on?
pipe_cfg = util.copy_config(pipeline[pipe_name])
raw_config = Config(filled["components"][pipe_name])
if pipe_name not in exclude:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it looks like we should move the exclude check to be the first thing in the for loop - we specifically say that excluded components won't be loaded.

Let's move that to a separate PR though?

spacy/errors.py Outdated
@@ -219,6 +219,9 @@ class Warnings(metaclass=ErrorsWithCodes):
W125 = ("The StaticVectors key_attr is no longer used. To set a custom "
"key attribute for vectors, configure it through Vectors(attr=) or "
"'spacy init vectors --attr'")
W126 = ("Pipe instance '{name}' is being added with a vocab "
"instance that will not match other components. This is "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"instance that will not match other components. This is "
"instance that does not match other components. This is "

@svlandeg svlandeg changed the base branch from master to main January 29, 2024 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants