Running the tokeniser in parallel does not gain a lot from more cores? #10306
-
So I have been trying to tokenize a large quantity of text, but don't seem to gain a lot from the 'n_process The setup is I read in a newline json and tokenized all texts. It includes (I have recalculated the number of tokens a read it in multiple times to simulate more text). The calculated number of tokens in the input length column (i.e. this is not calculated from the tokenization to avoid having that as a bottleneck).
I instead ran (convert to list = 0)
The outline of the script is this: print("Create data stream")
def file_gen(f_paths):
"""read in a list of file path"""
pbar = tqdm(f_paths)
for f_path in pbar:
with open(f_path) as f:
reader = ndjson.reader(f)
for row in reader:
yield row["BodyText"]
def text_chunk(text: str, chunk_size: int):
"""chunks a token stream"""
if chunk_size:
start_c = 0
end_c = start_c + chunk_size
t = text[start_c:end_c]
while t:
yield t
start_c = end_c
end_c += chunk_size
t = text[start_c:end_c]
else:
yield text
def token_stream(f_paths, chunk_size = None):
"""mostly this is here to get the it/s from the tqdm"""
articles = file_gen(f_paths)
for a in tqdm(articles):
for t in text_chunk(a, chunk_size):
yield t
nlp = spacy.blank("da")
f_paths = [path_to_file]*n_repeats
docs = nlp.pipe(token_stream(f_paths, chunk_size=2400), n_process=64, batch_size=1024)
list(docs) Removed code for recording time taken an looping over e.g. chunk size. So here i clearly gain some speed by using 64 cores vs. 16 cores, but nowhere near what I would have expected and it does not seem to be explained by factors such as inconsistent text length. Given that i get some improvement it also does not seem to be a problem with input. Is this scaling expected I should I expect better scaling? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Multiprocessing introduces a lot of overhead, and it's likely that the overhead outweighs the gains with 16->64 processes. In general, I've found that tokenizing with But just for tokenization, the doc serialization is a lot of overhead. I've found that it's much faster to use But for demos I have used |
Beta Was this translation helpful? Give feedback.
Multiprocessing introduces a lot of overhead, and it's likely that the overhead outweighs the gains with 16->64 processes.
In general, I've found that tokenizing with
nlp.pipe(n_process>1)
is slow, and I haven't done detailed profiling (famous last words), but I strongly suspect it's due to theDoc
serialization that's happening under the hood. (Peter has been profiling the serialization some and this may get at least a little bit faster in the next release: #10250)But just for tokenization, the doc serialization is a lot of overhead. I've found that it's much faster to use
nlp.pipe(n_process=1)
withmultiprocessing.Pool
and just return the space-separated text rather than aDoc
(or retu…