Store activations in Docs when save_activations is enabled (explo…

…sion#11002) * Store activations in Doc when `store_activations` is enabled This change adds the new `activations` attribute to `Doc`. This attribute can be used by trainable pipes to store their activations, probabilities, and guesses for downstream users. As an example, this change modifies the `tagger` and `senter` pipes to add an `store_activations` option. When this option is enabled, the probabilities and guesses are stored in `set_annotations`. * Change type of `store_activations` to `Union[bool, List[str]]` When the value is: - A bool: all activations are stored when set to `True`. - A List[str]: the activations named in the list are stored * Formatting fixes in Tagger * Support store_activations in spancat and morphologizer * Make Doc.activations type visible to MyPy * textcat/textcat_multilabel: add store_activations option * trainable_lemmatizer/entity_linker: add store_activations option * parser/ner: do not currently support returning activations * Extend tagger and senter tests So that they, like the other tests, also check that we get no activations if no activations were requested. * Document `Doc.activations` and `store_activations` in the relevant pipes * Start errors/warnings at higher numbers to avoid merge conflicts Between the master and v4 branches. * Add `store_activations` to docstrings. * Replace store_activations setter by set_store_activations method Setters that take a different type than what the getter returns are still problematic for MyPy. Replace the setter by a method, so that type inference works everywhere. * Use dict comprehension suggested by @svlandeg * Revert "Use dict comprehension suggested by @svlandeg" This reverts commit 6e7b958. * EntityLinker: add type annotations to _add_activations * _store_activations: make kwarg-only, remove doc_scores_lens arg * set_annotations: add type annotations * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * TextCat.predict: return dict * Make the `TrainablePipe.store_activations` property a bool This means that we can also bring back `store_activations` setter. * Remove `TrainablePipe.activations` We do not need to enumerate the activations anymore since `store_activations` is `bool`. * Add type annotations for activations in predict/set_annotations * Rename `TrainablePipe.store_activations` to `save_activations` * Error E1400 is not used anymore This error was used when activations were still `Union[bool, List[str]]`. * Change wording in API docs after store -> save change * docs: tag (save_)activations as new in spaCy 4.0 * Fix copied line in morphologizer activations test * Don't train in any test_save_activations test * Rename activations - "probs" -> "probabilities" - "guesses" -> "label_ids", except in the edit tree lemmatizer, where "guesses" -> "tree_ids". * Remove unused W400 warning. This warning was used when we still allowed the user to specify which activations to save. * Formatting fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Replace "kb_ids" by a constant * spancat: replace a cast by an assertion * Fix EOF spacing * Fix comments in test_save_activations tests * Do not set RNG seed in activation saving tests * Revert "spancat: replace a cast by an assertion" This reverts commit 0bd5730. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
jordankanter · Mar 14, 2024 · bc10678 · bc10678
1 parent 6e6c5a7
commit bc10678
Show file tree

Hide file tree

Showing 28 changed files with 669 additions and 355 deletions.
diff --git a/spacy/pipeline/edit_tree_lemmatizer.py b/spacy/pipeline/edit_tree_lemmatizer.py
@@ -4,8 +4,8 @@
 
 import numpy as np
 import srsly
-from thinc.api import Config, Model, NumpyOps, SequenceCategoricalCrossentropy
-from thinc.types import Floats2d, Ints2d
+from thinc.api import Config, Model, SequenceCategoricalCrossentropy
+from thinc.types import ArrayXd, Floats2d, Ints1d
 
 from .. import util
 from ..errors import Errors
@@ -22,6 +22,9 @@
 TOP_K_GUARDRAIL = 20
 
 
+ActivationsT = Dict[str, Union[List[Floats2d], List[Ints1d]]]
+
+
 default_model_config = """
 [model]
 @architectures = "spacy.Tagger.v2"
@@ -50,6 +53,7 @@
         "overwrite": False,
         "top_k": 1,
         "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
+        "save_activations": False,
     },
     default_score_weights={"lemma_acc": 1.0},
 )
@@ -62,6 +66,7 @@ def make_edit_tree_lemmatizer(
     overwrite: bool,
     top_k: int,
     scorer: Optional[Callable],
+    save_activations: bool,
 ):
     """Construct an EditTreeLemmatizer component."""
     return EditTreeLemmatizer(
@@ -73,6 +78,7 @@ def make_edit_tree_lemmatizer(
         overwrite=overwrite,
         top_k=top_k,
         scorer=scorer,
+        save_activations=save_activations,
     )
 
 
@@ -92,6 +98,7 @@ def __init__(
         overwrite: bool = False,
         top_k: int = 1,
         scorer: Optional[Callable] = lemmatizer_score,
+        save_activations: bool = False,
     ):
         """
         Construct an edit tree lemmatizer.
@@ -103,6 +110,7 @@ def __init__(
             frequency in the training data.
         overwrite (bool): overwrite existing lemma annotations.
         top_k (int): try to apply at most the k most probable edit trees.
+        save_activations (bool): save model activations in Doc when annotating.
         """
         self.vocab = vocab
         self.model = model
@@ -117,7 +125,7 @@ def __init__(
 
         self.cfg: Dict[str, Any] = {"labels": []}
         self.scorer = scorer
-        self.numpy_ops = NumpyOps()
+        self.save_activations = save_activations
 
     def get_loss(
         self, examples: Iterable[Example], scores: List[Floats2d]
@@ -146,31 +154,24 @@ def get_loss(
 
         return float(loss), d_scores
 
-    def predict(self, docs: Iterable[Doc]) -> List[Ints2d]:
-        if self.top_k == 1:
-            scores2guesses = self._scores2guesses_top_k_equals_1
-        elif self.top_k <= TOP_K_GUARDRAIL:
-            scores2guesses = self._scores2guesses_top_k_greater_1
-        else:
-            scores2guesses = self._scores2guesses_top_k_guardrail
-        # The behaviour of *_scores2guesses_top_k_greater_1()* is efficient for values
-        # of *top_k>1* that are likely to be useful when the edit tree lemmatizer is used
-        # for its principal purpose of lemmatizing tokens. However, the code could also
-        # be used for other purposes, and with very large values of *top_k* the method
-        # becomes inefficient. In such cases, *_scores2guesses_top_k_guardrail()* is used
-        # instead.
+    def predict(self, docs: Iterable[Doc]) -> ActivationsT:
         n_docs = len(list(docs))
         if not any(len(doc) for doc in docs):
             # Handle cases where there are no tokens in any docs.
             n_labels = len(self.cfg["labels"])
-            guesses: List[Ints2d] = [self.model.ops.alloc2i(0, n_labels) for _ in docs]
+            guesses: List[Ints1d] = [
+                self.model.ops.alloc((0,), dtype="i") for doc in docs
+            ]
+            scores: List[Floats2d] = [
+                self.model.ops.alloc((0, n_labels), dtype="i") for doc in docs
+            ]
             assert len(guesses) == n_docs
-            return guesses
+            return {"probabilities": scores, "tree_ids": guesses}
         scores = self.model.predict(docs)
         assert len(scores) == n_docs
         guesses = scores2guesses(docs, scores)
         assert len(guesses) == n_docs
-        return guesses
+        return {"probabilities": scores, "tree_ids": guesses}
 
     def _scores2guesses_top_k_equals_1(self, docs, scores):
         guesses = []
@@ -230,8 +231,13 @@ def _scores2guesses_top_k_guardrail(self, docs, scores):
 
         return guesses
 
-    def set_annotations(self, docs: Iterable[Doc], batch_tree_ids):
+    def set_annotations(self, docs: Iterable[Doc], activations: ActivationsT):
+        batch_tree_ids = activations["tree_ids"]
         for i, doc in enumerate(docs):
+            if self.save_activations:
+                doc.activations[self.name] = {}
+                for act_name, acts in activations.items():
+                    doc.activations[self.name][act_name] = acts[i]
             doc_tree_ids = batch_tree_ids[i]
             if hasattr(doc_tree_ids, "get"):
                 doc_tree_ids = doc_tree_ids.get()