-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Adding multi-threading to algorithm eval node. #343
Conversation
@jmilloy It didn't seem to make sense to do a mixin -- the structure between this and the compositor was different enough. I do wonder if we should have a common/global Threadpool, because right now this structure allows threads to create additional threads.. in fact, it probably SHOULD have a common thread pool. Do you agree? |
Actually, I take it back about the common thread pool. I can see cases where the common thread pool would cause the who thing to hang. 10 of A generates 10 threads B -- takes up entire thread Pool Do you agree @jmilloy |
Okay @jmilloy this should be ready to merge.
|
You bring up a serious question. If I set n_threads to 10, then I should expect there to be no more than 10 threads total, not 10 threads per Algorithm node. There needs to be a way to separate the execution into separate threads up to a point, and then force subsequent nodes to stay within their thread. Secondly, are there caching issues here to be wary of? For example, if two nodes A1 and A2 require the results from the same input node B, when they evaluate serially, A1 will cause B to evaluate, and A2 will just use the cached results from B. If they execute in parallel, B will start evaluation twice, and one will finish first and cache its result, and the second will try to cache the same result. We need to add a test to make sure that this still works. Either the second one should ignore the exception that is raised by there now being a cached result, or, even better, the first one should mark that a cached result will be available so that the second one can wait for it. |
I'm not sure. The |
I just spoke with @mls. I think the correct behavior is to create threads untill the common thread pool is not full. If the thread pool is full, then evaluate serially in the current thread. This should be ok even if you only have 10 threads but need 16 threads at a particular level. At this level, 6 threads will be waiting. One level lower, when a thread tries to create more theads it will notice the threadpool is full, so then proceed serially. Caching... didn't think about that one. It will be tricky to test I'm guessing... well maybe not. I'll work on it. So, two things I'll work on:
|
I'm not sure if this applies to a thread pool, as well. Should methods of a threadpool only ever be used by the thread which created it? Unfortunately, the threadpool is not really documented anywhere.
This seems backwards. Just wanted to check.
Yeah. I don't know how hard it will be to keep track of how many threads are available. Even if 9 out of the 10 threads are used, and you need 100 threads, you can use the thread pool. But I think you will have race-conditions if you are not careful. You probably need to lock the thread pool when you are checking if it is full and starting workers so that the first worker doesn't start claiming more workers itself.
|
Maybe you can get away with something like this (by incrementing the in-use count before starting workers). Even if sibling nodes check if pool is full at the same time and both start claiming workers, they won't interfere with each other. I don't know. Have fun...........
|
Could even free up the pool a bit faster like this. Okay, I'll stop now.
|
So, based on
It seems like the better practice is to create new threadpools in each thread? That doesn't seem right. I understand the reason for it in the processor pool case, but it doesn't seem correct in the thread case. What's the general feeling about threads creating threads that create threads, etc... ? |
I think it is only just that they don't recommend trying to interact with the pool object from child processes, one reason being the deadlock scenario that you brought up. Of course, creating a new pool in the child process and using that would be fine. I would think it is the same for a thread pool. I think its fine for threads to create threads to create threads, and I think it's fine to create threadpools in threads in the same way. It's just probably not wise to try to re-use a common threadpool across tiers of threads, because of the race conditions and deadlock. Maybe our scenario is simple enough that we can manage it and avoid those issues. Perhaps a safer version of this, but still flexible, is to create the threadpool ad-hoc but track globally how many threads you have made. Here
|
I guess you could always allow one worker (which is the same as executing serially...)
|
I thought about that, it makes the code a little cleaner, but I don't feel great about daisy-chaining threads unnecessarily. |
That was fun, seems like it works: %%timeit
with podpac.settings:
podpac.settings['DEFAULT_CACHE'] = []
podpac.settings['RAM_CACHE_ENABLED'] = False
podpac.settings["CACHE_OUTPUT_DEFAUL"] = False
podpac.settings['MULTITHREADING'] = True
podpac.settings['N_THREADS'] = 32
alg.eval(coords)
2.46 s ± 73.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
with podpac.settings:
podpac.settings['DEFAULT_CACHE'] = []
podpac.settings['RAM_CACHE_ENABLED'] = False
podpac.settings["CACHE_OUTPUT_DEFAUL"] = False
podpac.settings['MULTITHREADING'] = True
podpac.settings['N_THREADS'] = 4
alg.eval(coords)
4.36 s ± 313 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
with podpac.settings:
podpac.settings['DEFAULT_CACHE'] = []
podpac.settings['RAM_CACHE_ENABLED'] = False
podpac.settings["CACHE_OUTPUT_DEFAUL"] = False
podpac.settings['MULTITHREADING'] = False
alg.eval(coords)
5.49 s ± 345 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Note that when I give it fewer threads it takes longer again! |
TODO: Add test for the race condition RE: Caching. |
…sure there should be no issues with caching outputs of Nodes. I think this is overkill given the GIL, but should now be pretty safe.
Added a race condition test. This is ready to go. Let met know if/when you're happy @jmilloy so we can merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any tests for the thread manager. I guess it would be okay not have explicit thread manager tests, because I think we should add an algorithm test that falls back to the serial evaluation when no more threads are available (or did I miss that your test does actually do that?).
I can make the changes that I've proposed here, if you don't have a chance to do that. Let me know if you want me to do that (and which you actually agree with and think are worth it).
* Made the ThreadPool creation part of the thread_manager * Doing serial computation if N_THREADS == 1 (had to release the obtained thread) * Added the _multi_threaded node attribute to help with testing/debugging multi-threaded execution * Using the Lock context manager instead of 'acquire/release' * Added test to stress the number of threads in the execution and checking to make that the correct number of threads cause a cascade to lower levels in the pipeline.
okay @jmilloy have another peek. Should now be good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I like the _multi_threaded debug flag, too.
This should allow IO parallelism.
Initial testing looks good. Running https://github.com/creare-com/podpac-drought-monitor/blob/develop/notebooks/Drought-Monitor-Pipeline.ipynb with two additional cells:
suggests a 2x increase in speed.
This is to close #312