Add utilities for parallelization #8320

McSinyx · 2020-05-25T08:38:21Z

This adds utils.parallel.map_{multiprocess,multiprocess}. It is to settle a fallback mechanism for worker pools to resolve GH-8169. Additionally, I want to use this as the place to discuss of the future use of this module. To avoid situation like GH-8161, it'd be really nice if we can have parallelization as an toggle-able unstable feature and frequent prereleases to attract more feedbacks, especially from those using more obscure platforms. Edit: I forget to run pre-commit before commit again.

cc @bmartinn on map_multiprocess

McSinyx · 2020-05-29T02:07:39Z

@uranusjr, I saw you giving this a thumb up while this was a draft. Since I've finished the tests, may I have a full review now?

pradyunsg · 2020-05-30T11:23:17Z

I do think that we should add an additional wrapper, to make using this a lot more transparent for the sequential case. We should also clean up the implementation as well, to remove the need for the try-except patterns checking if things are usable on every run.

I'm basically imagining this file would do all the checks on-import, and then, we'd have a single entry point, which can be used to dictate the processing: map_parallel(func, iterable). This would gracefully fall back to a regular map, while using whichever mechanism the user has requested for.

In terms of how we do this, I think that we should just straight up assume threading support exists on the platform, and exit gracefully in every pip command if it doesn't. We shouldn't use multiprocessing because it's very slow on some platforms which defeats the purpose of the parallelization! (overhead > benefit)

In terms of rolling this out, a good start would be to add the logic for detecting this now, and printing a warning that things might fail (when we detect that the support doesn't exist, via the deprecated helper) and switching to exiting gracefully starting in 20.3. That'll also give us the opportunity to see if we have users who care about non-threading Python -- letting them come to us and complain. If enough people complain, we'll reconsider what to do. If needed, we'll add code for doing things synchronously, by writing another function here, that has the same side-effects as the parallel utility and falling back to that on such platforms.

In terms of the implementation, We need to care about all the caveats of parallelization here, and it's one thing that we can't escape.

I think we should not guarantee the order of the returned iterable. Instead, we should use a callback/error_callback based approach here (they're arguments on Pool methods) along with a .join() at the end of these helpers. This means that the callbacks would need to use queues for pushing information back to the main thread, which would denote progress, handle proper error messaging etc. Anything needed to make implementing that easier (i.e. reuable bits!) would be what we'd put into this module, so that we'd have the ability to keep the final code as clean as possible.

Basically, these utilities would need some amount of effort put into them, in exchange for being closer to asyncio in terms of how-stuff-works (I know it's not exact, but it's closer), while not having *that much of implementation work and fitting in cleanly.

Also, timeout should probably not have a default. :P

McSinyx · 2020-05-30T13:58:24Z

I do think that we should add an additional wrapper, to make using this a lot more transparent for the sequential case.

I don't think I get what you mean by the *sequential case. Do you mean where higher-level code manually insert jobs to the pool?

We should also clean up the implementation as well, to remove the need for the try-except patterns checking if things are usable on every run.

I'm basically imagining this file would do all the checks on-import, and then, we'd have a single entry point.

I've thought of that and I'm neither fully with nor against it:

It does make the implementation of the functions in this module cleaner, however we'll need to do the if ...: def ... and while CPython actually does it a lot for wrapper code, I prefer having 2 blank lines between functions 🥺
Performance wise: check at import-time make pip starts up (marginally) slowlier, while runtime check gives (marginal) overhead (since exceptions are unlikely and are optimized as such I think). I don't think that the overhead in either case is an important factor though.

We shouldn't use multiprocessing because it's very slow on some platforms which defeats the purpose of the parallelization!

I think I forgot to give this PR enough context. This is not only for parallel networking, but also for CPU-intensive tasks like GH-8125. A large part of the hacks here are concluded from the discussion with @bmartinn over that PR.

In terms of rolling this out, a good start would be to add the logic for detecting this now, and printing a warning that things might fail [...] That'll also give us the opportunity to see if we have users who care about non-threading Python

I think you're referring to @pfmoore's comment over #8169 (comment), which doesn't fully capture the situation: the failure is due to the lack of sem_open on Android. I do not know if threading actually works on Android though. Catching ImportError as in this PR won't fallback in case threading isn't supported, so we will know when there's user feedback. I agree with the warning too: I want to know where multiprocessing[.dummy].Pool is not usable to know the community-wise impact of the speedup.

Also, timeout should probably not have a default. :P

It's an ugly hack to make KeyboardInterrupt work on Python 2 😞

I think we should not guarantee the order of the returned iterable. Instead, we should use a callback/error_callback based approach here

I originally wanted to provide both (1) the unordered and lazy and (2) the ordered and keen, but due to the bug above, I dropped the naïve attempt on the unordered one. I'll try to add the callback variant.

pradyunsg · 2020-05-31T16:51:34Z

this is not only for parallel networking, but also for CPU-intensive tasks like GH-8125

I presume you mean #8215? I don't think that task is CPU-intensive -- it's unpacking/moving files, which is definitely I/O bound.

I don't think subprocesses are the right approach for any of pip's "slow" parts, since all of those are I/O bound operations. :)

if ...: def ...

Rather, I meant:

def _parallel_whatever(...):
     assert _HAVE_THREAD_SUPPORT
     ...


def whatever(...):
    if _HAVE_THREAD_SUPPORT:
        return _parallel_whatever(...)
    # raise error, or call "non-parallel" fallback here.

I don't think I get what you mean by the *sequential case. Do you mean where higher-level code manually insert jobs to the pool?

I was referring to the case where we might want a sequential fallback for the parallel code here -- basically allowing for graceful degradation on platforms where we don't have threading support, if we want to do it that way. I don't know if we'd have to make this accommodation, and what amount of effort it'll take to "do it right"; but I do think if this might cause disruption, we should be a bunch more careful here.

check at import-time make pip starts up (marginally) slowlier, while runtime check gives (marginal) overhead (since exceptions are unlikely and are optimized as such I think).

We can optimize this later, by computing the specific value lazily. This feels like premature optimization and isn't a deal breaker IMO.

the failure is due to the lack of sem_open on Android.

I understand -- the effect is what matters though, that we can't use multiprocessing.dummy.Pool on that platform.

bmartinn · 2020-05-31T19:07:06Z

I presume you mean #8215? I don't think that task is CPU-intensive -- it's unpacking/moving files, which is definitely I/O bound.

Actually this is CPU bound, since the unzipping is python code, threading actually hurts performance (due to the GIL effect). This was the reason to introduce Process Pool (as opposed to the download part, that is accelerated by using threads, as it is mostly Network bounded)

I'm open to other ideas on accelerating the wheel unzipping. If the use case is a single package unzipping then this is negligible, but in case of installing an entire environment, even if the wheels are cached, just unzipping 30 packages can take above 30 seconds.

McSinyx · 2020-06-01T13:35:28Z

I presume you mean 8215? I don't think that task is CPU-intensive -- it's unpacking/moving files, which is definitely I/O bound.

@pradyunsg, yes 🤣 I haven't experiment with it much so I'll take @bmartinn's word for now. To @bmartinn, it would be nice if you can post the benchmark of multithreading vs multiprocessing over that PR. I'll try to catch up later doing the same thing and we'll see if the results match.

Rather, I meant: [...]

I wonder what is the benefits of doing so? I.e. what is the difference between conditional and exception handling? Also I figured that we can import multiprocessing.synchronize instead of creating a Pool at the beginning of the module if we want to do it.

I don't get your point on the fallback part: this PR already fallback known failures to map. Relating to this, on error handling, I've done a local test and the error got propagated just right (similar to when map is used). The problem was only with non-Exception like KeyboardInterrupt and only on Python 2.

Edit: please ignore a1cdbc1, I've thought of how to improve it right after pushing.

McSinyx · 2020-06-04T15:35:33Z

1b4dac5 made all ugly hacks I've ever made, even throw-away code I did for competitive programming pretty 😞

Quick iteractive tests (mainly to avoid the future me from rewrite it everytime):

from __future__ import print_function
from pip._internal.utils.parallel import *

def ID(x): return x

# OK
map_multiprocess(print, range(10**7))

# Edit: this cannot be cancelled anyhow,
# but if print is replaced by time.sleep,
# KeyboardInterrupt works as expected.
map_multithread(print, range(10**7))

# Hang forever, interrupt doesn't work properly
for i in imap_multiprocess(ID, range(10**7)): print(i)
for i in imap_multithread(ID, range(10**7)): print(i)

# OK, but need to imply chunksize from input size
for i in imap_multiprocess(ID, range(10**7), 10**6): print(i)
for i in imap_multithread(ID, range(10**7), 10**6): print(i)

At this point I'm not sure if the laziness is worth the amount of hacks we need to pull out. What do you think @pradyunsg?

bmartinn · 2020-06-04T18:27:15Z

Hi @McSinyx ,

Hang forever, interrupt doesn't work properly

Which python version ?
Do both process and thread version hang ?

McSinyx · 2020-06-05T01:47:28Z

@bmartinn, it's on Python 3 (the Python 2 version is just iter wrapping around the list for interface compatibility), and yes, both hang. I'm on GNU/Linux if that matters.

pradyunsg · 2020-06-21T12:37:53Z

src/pip/_internal/utils/parallel.py

+
+
+@contextmanager
+def closing(pool):


This isn't necessary, since pool's exit should handle closing and other details.

It's needed for imap* to start submitting tasks to the pool for some reason. The non lazy variant doesn't need it though. ~~I think I'll add a comment explaining why it is needed.~~ Edit: I have, but I think my pun made it unclear so I'm gonna rephrase.

src/pip/_internal/utils/parallel.py

pradyunsg · 2020-06-21T12:51:17Z

src/pip/_internal/utils/parallel.py

+    """Make an iterator applying func to each element in iterable.
+
+    This function is the sequential fallback when sem_open is unavailable.
+    """


I'd suggest dropping the docstrings in all the function definitions, and instead describe the functions in the module docstring (see other comment about trimming the API of this module to only 2 functions).

These internal-only function names are fairly self-explanatory, and I don't think the value-add of static analysis finding the relevant docstring is worth the duplication in this module; which makes it difficult to navigate + find relevant code.

Here's a rough sample for what the docstring could be:

"""Helpers for parallelization of higher order functions. This module provides two helper functions, with appropriate fallbacks on Py2 and on systems lacking support for synchronization mechanisms. - ``map_multiprocess`` - ``map_multithread`` These helpers work like `map`, with 2 differences: - They don't guarantee the order of processing of the elements of the iterable. - The underlying process/thread pools chop the iterable into a number of chunks, and the (approximate) size of these chunks can be specified by passing an optional keyword-only argument ``chunksize`` (positive integer). """

I'd suggest dropping the docstrings in all the function definitions

I'm from the ask-docs camp (which is bound to K in my editor), and I'd like to support my kind! I wonder if we keep the function docstrings, should we add the docs you suggested to the module docstring or that those should exist exclusively though.

Edit: I figure the intended API would be either map_multi* which says neither about what happens to long input nor unorderedness.

Another edit: I'm going non-ReST for the module docstring.

pradyunsg · 2020-06-22T01:24:29Z

Windows isn't happy with these changes. :)

McSinyx · 2020-06-22T04:33:18Z

Linting was also failing 😄, however test failed on py35 on Windows seems irrelevant:

def test_rmtree_retries_for_3sec(tmpdir, monkeypatch):
    """
    Test pip._internal.utils.rmtree will retry failures for no more than 3 sec
    """
    monkeypatch.setattr(shutil, 'rmtree', Failer(duration=5).call)
    with pytest.raises(OSError):
        rmtree('foo')

Edit: but it is somehow, hmmm...

Edit: this confuses me, the failing test on Windows was first on Python 3.5, then 2.7, then 3.6, all on i386. I think I made something flaky 😞

McSinyx · 2020-06-22T14:13:15Z

Could I please have your help on the failing test, @pfmoore? I can't wrap my head around its logic and I don't know how the import mocks I introduce make retry flaky (my guess because that's the only state changed by this PR).

pfmoore · 2020-06-22T14:34:23Z

Sorry, no idea to be honest. The failure doesn't seem to be related to the changes in this PR at all...

Oh wait, you're doing some really nasty hacks with the import mechanism. I wonder whether it's an interaction between what you're doing and the pytest plugin that runs our tests in parallel?

McSinyx · 2020-06-22T15:11:57Z

runs our tests in parallel

Oh no! I think I'll revert back to the handling in d9d18ff which was easier/less hacky to test.

McSinyx · 2020-06-22T16:21:26Z

False alarm: that test is flaky, Imma revert back to ec6e31e then sigh

Edit: I think this is ready now.

Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com>

pradyunsg

LGTM. There's still stuff that we might want to iterate on, but this is a good start. :)

pradyunsg · 2020-06-29T07:54:07Z

I'm gonna go ahead and merge this in, since there hasn't been any activity over the past 4 days here. We can iterate on this further based on inputs/learning.

xavfernandez · 2020-07-03T20:53:26Z

tests/unit/test_utils_parallel.py

+_import = __import__
+
+
+def reload_parallel():


Ideally this should be called one last time after all the tests have run (maybe via an auto-use fixture at module scope) to make sure it is loaded without being affected by the tests monkeypatch.

McSinyx marked this pull request as ready for review May 25, 2020 15:01

McSinyx force-pushed the pools branch 2 times, most recently from 9243eb8 to 498beac Compare May 25, 2020 15:11

pradyunsg mentioned this pull request May 31, 2020

Parallelizing the install process + PoC! #8187

Open

pradyunsg changed the title ~~Add utilities for paralleliztion~~ Add utilities for parallelization Jun 2, 2020

McSinyx force-pushed the pools branch 2 times, most recently from a1cdbc1 to 1b4dac5 Compare June 4, 2020 15:22

pradyunsg reviewed Jun 21, 2020

View reviewed changes

McSinyx force-pushed the pools branch from 1b4dac5 to 54042e6 Compare June 21, 2020 16:26

McSinyx force-pushed the pools branch from 54042e6 to ec6e31e Compare June 22, 2020 04:27

McSinyx closed this Jun 22, 2020

McSinyx reopened this Jun 22, 2020

McSinyx force-pushed the pools branch from 1c86530 to 677ee15 Compare June 22, 2020 16:24

McSinyx requested a review from pradyunsg June 22, 2020 17:24

McSinyx and others added 4 commits June 25, 2020 21:10

Add utilities for paralleliztion

134ae32

Add tests for utils.parallel

e7f637e

Wrap lazy map as well

13539d0

Drop parallel map for Python 2 and the non-lazy variant

0a3b20f

Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com>

McSinyx force-pushed the pools branch from 677ee15 to 0a3b20f Compare June 25, 2020 14:11

pradyunsg approved these changes Jun 25, 2020

View reviewed changes

McSinyx mentioned this pull request Jun 27, 2020

Parallelize pip list --outdated and --uptodate #8504

Merged

pradyunsg added the type: feature request Request for a new feature label Jun 29, 2020

pradyunsg merged commit 3a22663 into pypa:master Jun 29, 2020

McSinyx deleted the pools branch June 29, 2020 14:53

bmartinn mentioned this pull request Jun 30, 2020

Add parallel install with fall-back to serial install when no multiprocessing available #8215

Closed

xavfernandez reviewed Jul 3, 2020

View reviewed changes

McSinyx mentioned this pull request Jul 4, 2020

Make utils.parallel tests tear down properly #8538

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add utilities for parallelization #8320

Add utilities for parallelization #8320

McSinyx commented May 25, 2020 •

edited

Loading

McSinyx commented May 29, 2020

pradyunsg commented May 30, 2020

McSinyx commented May 30, 2020 •

edited

Loading

pradyunsg commented May 31, 2020

bmartinn commented May 31, 2020

McSinyx commented Jun 1, 2020 •

edited

Loading

McSinyx commented Jun 4, 2020 •

edited

Loading

bmartinn commented Jun 4, 2020

McSinyx commented Jun 5, 2020

pradyunsg Jun 21, 2020

McSinyx Jun 21, 2020 •

edited

Loading

pradyunsg Jun 21, 2020

pradyunsg Jun 21, 2020

McSinyx Jun 21, 2020 •

edited

Loading

pradyunsg commented Jun 22, 2020

McSinyx commented Jun 22, 2020 •

edited

Loading

McSinyx commented Jun 22, 2020

pfmoore commented Jun 22, 2020

McSinyx commented Jun 22, 2020

McSinyx commented Jun 22, 2020 •

edited

Loading

pradyunsg left a comment

pradyunsg commented Jun 29, 2020

xavfernandez Jul 3, 2020

Add utilities for parallelization #8320

Add utilities for parallelization #8320

Conversation

McSinyx commented May 25, 2020 • edited Loading

McSinyx commented May 29, 2020

pradyunsg commented May 30, 2020

McSinyx commented May 30, 2020 • edited Loading

pradyunsg commented May 31, 2020

bmartinn commented May 31, 2020

McSinyx commented Jun 1, 2020 • edited Loading

McSinyx commented Jun 4, 2020 • edited Loading

bmartinn commented Jun 4, 2020

McSinyx commented Jun 5, 2020

pradyunsg Jun 21, 2020

Choose a reason for hiding this comment

McSinyx Jun 21, 2020 • edited Loading

Choose a reason for hiding this comment

pradyunsg Jun 21, 2020

Choose a reason for hiding this comment

pradyunsg Jun 21, 2020

Choose a reason for hiding this comment

McSinyx Jun 21, 2020 • edited Loading

Choose a reason for hiding this comment

pradyunsg commented Jun 22, 2020

McSinyx commented Jun 22, 2020 • edited Loading

McSinyx commented Jun 22, 2020

pfmoore commented Jun 22, 2020

McSinyx commented Jun 22, 2020

McSinyx commented Jun 22, 2020 • edited Loading

pradyunsg left a comment

Choose a reason for hiding this comment

pradyunsg commented Jun 29, 2020

xavfernandez Jul 3, 2020

Choose a reason for hiding this comment

McSinyx commented May 25, 2020 •

edited

Loading

McSinyx commented May 30, 2020 •

edited

Loading

McSinyx commented Jun 1, 2020 •

edited

Loading

McSinyx commented Jun 4, 2020 •

edited

Loading

McSinyx Jun 21, 2020 •

edited

Loading

McSinyx Jun 21, 2020 •

edited

Loading

McSinyx commented Jun 22, 2020 •

edited

Loading

McSinyx commented Jun 22, 2020 •

edited

Loading