How to prevent vocabulary from constantly growing? #9369
Replies: 3 comments 2 replies
-
The typical solution is to periodically reload the pipeline, obviously as long as you don't have any saved The script from #5083 worked in spacy v2 because there was a hard-coded limit for the size of the lexeme cache, which was removed in v3 since it's something that should be up to the user. In v2 the string store would still grow in the same way as in v3, but the lexeme cache wouldn't. A similar v3 version could look something like this. Here the entire pipeline is reloaded to keep everything in sync and the strings are reset to the original minimal set required by the freshly initialized components. import spacy
import random
from string import ascii_letters, digits
def generate_strings(as_tuples=False):
while True:
s = "".join(
[random.choice(" " * 10 + ascii_letters + digits) for n in range(20)]
)
yield s
def main():
nlp = spacy.blank("xx")
nlp_bytes = nlp.to_bytes()
minimal_strings = set(nlp.vocab.strings) | set(
nlp.vocab.strings[lex.orth] for lex in nlp.vocab
)
current_vocab_size = len(nlp.vocab)
for i, doc in enumerate(nlp.pipe(generate_strings())):
if not i % 10000:
print(i, len(nlp.vocab), len(nlp.vocab.strings), doc.text)
if len(nlp.vocab) > 10000:
nlp.from_bytes(nlp_bytes)
nlp.vocab.strings._reset_and_load(minimal_strings)
if __name__ == "__main__":
main() Depending on the pipeline components, you might be able to reduce what's serialized and reloaded it a bit, in particular the vocab lookups and vectors, but they can also grow depending on what your pipeline does (retokenization adds vectors, setting certain lexeme properties adds lookups entries). Just reloading the vocab won't necessarily work because other components may have cache entries that reference strings or lexemes (here, it's the tokenizer cache). In general, the overall pipeline design is that a component can assume that something that it has added to the string store is always there in the future. |
Beta Was this translation helpful? Give feedback.
-
Please understand that this script was just meant to update the previous example for v3, to demonstrate what needs to be reloaded in the general case. However, I just took another look at So it would be better to use When you reload a model, you do need to be sure that you've previously called If you are using models with torch on GPU, you want to add this to have cupy and pytorch share the same memory pool: set_gpu_allocator("pytorch") See the example here: https://spacy.io/usage/embeddings-transformers#transformers-runtime And a related issue about GPU memory usage: #8984 (comment) |
Beta Was this translation helpful? Give feedback.
-
Thank you again. Great advice. |
Beta Was this translation helpful? Give feedback.
-
I apologize that this is essentially a repeat of what I have saved as:
Streaming Data Memory Growth (reprise) GitHub #5083
I have a 'prediction server', which receives a document, runs the NER or SpanCat pipeline, returns predicted entities and drops the document. And it keeps growing in memory.
More specifically, the nlp.vocab keeps growing, as each request (document) contains a new names, misspelled words etc.
I tried using (various) variations of the code suggested in #5083 (see below). But no matter what I do, the call
nlp.vocab.strings._reset_and_load(minimal_strings)
does not reduce the vocabulary, it just keeps growing.
The code posted under #5083:
I am puzzled by the line:
minimal_strings.update([nlp.vocab.strings[lex.orth] for lex in nlp.vocab])
Would not that simply make those 'minimal_strings' equal to what is the current vocabulary (and hence prevent any 'reduction')?
But even when I skip that update and keep calling:
nlp.vocab.strings._reset_and_load(minimal_strings)
using the original, unchanged, unmodified minimal_strings, the vocabulary does not 'shrink'. It keeps growing. Almost like
_reset_and_load() only made a union of what was there before the call and the 'minimal_strings' (i.e. no 'reset').
Any suggestion what I may be doing wrong, or better: Is there any better way to keep my server from growing?
Beta Was this translation helpful? Give feedback.
All reactions