Why is using multiprocessing with AllenNLP decoding sluggish compared to non-multiprocessing case? #5471

slbayer · 2021-11-17T23:08:43Z

slbayer
Nov 17, 2021

Hi all -

I posted this on StackOverflow, and I'm repeating it here.

I'm using the AllenNLP (version 2.6) semantic role labeling model to process a large pile of sentences. My Python version is 3.7.9. I'm on MacOS 11.6.1. My goal is to use multiprocessing.Pool to parallelize the work, but the calls via the pool are taking longer than they do in the parent process, sometimes substantially so.

In the parent process, I have explicitly placed the model in shared memory as follows:

from allennlp.predictors import Predictor            
from allennlp.models.archival import load_archive
import allennlp_models.structured_prediction.predictors.srl
PREDICTOR_PATH = "...<srl model path>..."

archive = load_archive(PREDICTOR_PATH)
archive.model.share_memory()
PREDICTOR = Predictor.from_archive(archive)

I know the model is only being loaded once, in the parent process. And I place the model in shared memory whether or not I'm going to make use of the pool. I'm using torch.multiprocessing, as many recommend, and I'm using the spawn start method.

I'm calling the predictor in the pool using Pool.apply_async, and I'm timing the calls within the child processes. I know that the pool is using the available CPUs (I have six cores), and I'm nowhere near running out of physical memory, so there's no reason for the child processes to be swapped to disk.

Here's what happens, for a batch of 395 sentences:

Without multiprocessing: 638 total processing seconds (and elapsed time).
With a 4-process pool: 293 seconds elapsed time, 915 total processing seconds.
With a 12-process pool: 263 seconds elapsed time, 2024 total processing seconds.

The more processes, the worse the total AllenNLP processing time - even though the model is explicitly in shared memory, and the only thing that crosses the process boundary during the invocation is the input text and the output JSON.

I've done some profiling, and the first thing that leaps out at me is that the function torch._C._nn.linear is taking significantly longer in the multiprocessing cases. This function takes two tensors as arguments - but there are no tensors being passed across the process boundary, and I'm decoding, not training, so the model should be entirely read-only. It seems like it has to be a problem with locking or competition for the shared model resource, but I don't understand at all why that would be the case. And I'm not a torch programmer, so my understanding of what's happening is limited.

Any pointers or suggestions would be appreciated.

Answered by slbayer

Dec 1, 2021

It would be weird to mark the previous thread as the answer, since it's buried deep in the replies, but the answer appears to be that once I time things carefully enough, the mystery goes away, probably because I wasn't taxing my machine enough in the right ways in the baseline case. Thanks to @epwalsh for his patience.

View full answer

epwalsh · 2021-11-18T00:10:14Z

epwalsh
Nov 18, 2021
Maintainer

It seems like it has to be a problem with locking or competition for the shared model resource

That's very likely and would be my first guess as well. I'm not totally sure what torch.share_memory() does under the hood, but there must be a lock at some level, and if that lock does not allow for multiple readers or just doesn't differentiate between reads and writes, then there's going to be a lot of contention.

0 replies

slbayer · 2021-11-18T14:44:08Z

slbayer
Nov 18, 2021
Author

Well, my next thought was to try to defeat the memory sharing, and I redefined torch.Tensor.share_memory_ and torch.Tensor.is_shared before I loaded my predictor, but neither actually seems to be called during decoding, and I can't for the life of me figure out what data is actually being placed in shared memory, and where, and how, and by whom.

1 reply

epwalsh Nov 18, 2021
Maintainer

Have you tried just using a separate copy of the model in each worker process?

slbayer · 2021-11-18T19:55:14Z

slbayer
Nov 18, 2021
Author

I have tried this, in fact, I've tried several iterations, since each time I suspected that I had somehow failed to isolate the child process elements. I've found no way to reduce the multiprocessing below the numbers in the original question. It scales with the number of processes, and I'm still nowhere near hitting the limit of memory on my machine. Nothing should be swapped, and nothing should be shared, but yet I can't get single-process speed in the child processes no matter what I try.

2 replies

epwalsh Nov 20, 2021
Maintainer

I saw from your comment on Stack Overflow that you copied the model from the main process to the worker processes? Have you tried just instantiating the model in the worker processes so the model is never passed between processes?

slbayer Nov 20, 2021
Author

I have, indeed, tried this. I can't convince myself that I've succeeded in completely separating the models. When I load the model from scratch in the child, I still get the slowdowns. I haven't tried absolutely everything; my next move is to try doing that with regular Python multiprocessing rather than the torch enhancement, because I have a suspicion that somehow the model content is being tracked within the process group if I use torch multiprocessing. At this point, the only combination I've found so far that gives me single-process analysis times is to start up two separate processes by hand, which can't possibly be the answer. I've posted a separate question about the memory sharing to the discuss.pytorch.org forums, but no one's responded yet.

slbayer · 2021-11-18T20:11:04Z

slbayer
Nov 18, 2021
Author

And, to answer the obvious question, when I run two non-parallel processes in two different shells, the processing speed is single-process in both. I'm wondering whether, somehow, the torch mechanism for shared memory is looking at the model I'm loading again and putting it in shared memory because multiple children are using the same model, regardless of whether it's the same model instance.

1 reply

epwalsh Nov 22, 2021
Maintainer

Do you have a snippet of code you could share? If I had something to run I'd be interested in debugging this.

slbayer · 2021-11-22T19:29:38Z

slbayer
Nov 22, 2021
Author

I'll see if I can pare something down - the code is complicated and not publicly available. I'm off this week - back next week. Pete wrote:

…

Do you have a snippet of code you could share? If I had something to run I'd be interested in debugging this. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5471 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKR73CTXXZFNKHFKZRYGZDUNJYTNANCNFSM5IIGOCGQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

8 replies

epwalsh Nov 30, 2021
Maintainer

Thanks @slbayer. I took your script and ran a couple of experiments:

command	processing time (max)
`python test_utility.py sequential 20`	29.23
`python test_utility.py sequential 20` x 4	36.21
`python test_utility.py child_4 20`	37.77

The second row means I ran that command from 4 different terminals in parallel. The results from the 2nd row and the 3rd row are very close, as I would expect them to be. I don't see anything unusual here, do you?

slbayer Nov 30, 2021
Author

One way or another, thanks for looking at this.

What's your configuration? If you have enough cores and memory, I don't see any reason for your first two lines to differ; that surprises me.

I didn't write my test script to vary the number of workers, but as I add workers, the processing times increase, even as noticeable overhead in my CPUs and memory remain. In any case, the real test is a comparison between parent_4 and child_4; I'm assuming that due to pytorch, there would be competition for the model in the parent_4 case, but there shouldn't be any in the child_4 case. In other words, the child_4 case ought to be noticeably faster than the parent_4 case. But on my machine, they don't differ at all.

Again, thanks for looking at this.

epwalsh Nov 30, 2021
Maintainer

I don't see any reason for your first two lines to differ

Why not? In the experiment from the 2nd line (and the 3rd), each of the 4 processes had to deal with more contention for compute and memory. A PyTorch model on CPU doesn't just use a single operating system thread. PyTorch is optimized for CPU, not just GPU. The more resources available, the faster it will be (up to a point).

slbayer Nov 30, 2021
Author

Without knowing your configuration, I can't comment. On my machine, I have 6 cores and 32 GB of memory, and 4 parallel processes don't tax the configuration. I didn't check 4 parallel separate invocations on my machine, only 2; and I can't actually do it right now, because I've got a long job running that likely won't finish until tomorrow AM. Once that's done, I'll run 4 parallel processes and see what I get.

slbayer Dec 1, 2021
Author

OK, looks like you were right. This morning, on my idle machine, 4 parallel instances of sequential 20 take the same amount of processing time as parent_4 and child_4, and I added an additional minimal case with 4 threads and that takes a little less clock time and a little more processing time. I must not have been keeping careful enough track of things in my original, more complex test - or there are things that I wasn't accounting for that were confusing things. Thanks so much for your patience and time.

slbayer · 2021-12-01T15:12:27Z

slbayer
Dec 1, 2021
Author

It would be weird to mark the previous thread as the answer, since it's buried deep in the replies, but the answer appears to be that once I time things carefully enough, the mystery goes away, probably because I wasn't taxing my machine enough in the right ways in the baseline case. Thanks to @epwalsh for his patience.

1 reply

epwalsh Dec 1, 2021
Maintainer

Great, I'm glad we got to the bottom of this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is using multiprocessing with AllenNLP decoding sluggish compared to non-multiprocessing case? #5471

{{title}}

Replies: 6 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why is using multiprocessing with AllenNLP decoding sluggish compared to non-multiprocessing case? #5471

slbayer Nov 17, 2021

Replies: 6 comments · 13 replies

epwalsh Nov 18, 2021 Maintainer

slbayer Nov 18, 2021 Author

epwalsh Nov 18, 2021 Maintainer

slbayer Nov 18, 2021 Author

epwalsh Nov 20, 2021 Maintainer

slbayer Nov 20, 2021 Author

slbayer Nov 18, 2021 Author

epwalsh Nov 22, 2021 Maintainer

slbayer Nov 22, 2021 Author

epwalsh Nov 30, 2021 Maintainer

slbayer Nov 30, 2021 Author

epwalsh Nov 30, 2021 Maintainer

slbayer Nov 30, 2021 Author

slbayer Dec 1, 2021 Author

slbayer Dec 1, 2021 Author

epwalsh Dec 1, 2021 Maintainer

slbayer
Nov 17, 2021

Replies: 6 comments 13 replies

epwalsh
Nov 18, 2021
Maintainer

slbayer
Nov 18, 2021
Author

epwalsh Nov 18, 2021
Maintainer

slbayer
Nov 18, 2021
Author

epwalsh Nov 20, 2021
Maintainer

slbayer Nov 20, 2021
Author

slbayer
Nov 18, 2021
Author

epwalsh Nov 22, 2021
Maintainer

slbayer
Nov 22, 2021
Author

epwalsh Nov 30, 2021
Maintainer

slbayer Nov 30, 2021
Author

epwalsh Nov 30, 2021
Maintainer

slbayer Nov 30, 2021
Author

slbayer Dec 1, 2021
Author

slbayer
Dec 1, 2021
Author

epwalsh Dec 1, 2021
Maintainer