Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading and unsupported platforms #8169

Closed
McSinyx opened this issue Apr 29, 2020 · 6 comments · Fixed by #8320
Closed

Multithreading and unsupported platforms #8169

McSinyx opened this issue Apr 29, 2020 · 6 comments · Fixed by #8320
Labels
state: needs discussion This needs some more discussion type: maintenance Related to Development and Maintenance Processes

Comments

@McSinyx
Copy link
Contributor

McSinyx commented Apr 29, 2020

This is opened as the continuation of GH-7962, GH-8161, GH-8162, GH-3981 and GH-4654 and is one of the approaches to solve GH-825.

Why multithreading?

One common inconvenience with using pip is the delay for networking, since most package indices are not really fast[citation needed] and during package management pip needs to fetch many things (the package list, the packages themselves, etc.). Parallelization is one obvious solution to tackle this, and I hope it will the cheaper one, hence this issue is open to ensure that the implementation process will not be a labor-expensive work.

Until next year when Python 2 support is dropped, there are two options: multithreading and multiprocessing. While the latter is safer, (1) not every platform has multiple CPU cores and (2) the modified code will need to undergo a huge refactoring to give each core the data it needs. So we are left with multiprocessing. The Python 3 asyncio immediate solution however (plus it also require making many existing routines awaitable).

What is the problem with multithreading?

Putting thread-safety aside (not because it's not a problem, but rather because I think everyone knows how problematic it is), the most obvious solution provided by Python multiprocessing.dummy.Pool requires sem_open (bpo-3770), which seems to raises ImportError during initialization of the pool's attributes. Since sem_open is to be provided by the operating system, this raises the question that whether multiprocessing.dummy is supported on platforms that pip care to support and is (the more generic?) threading suffers the same issue if we implement the Pool ourselves. How about concurrent.futures (GH-3981)? Would it be worth it to do it, from the developers' perspective as well as that of our users, if things go wrong on their platform?

If we decide to do it anyway, how?

From GH-8162, IMHO it is safe to assume that (this is a really dangerous thing to say 😞) we can fallback to map if multiprocessing.dummy.Pool can't have sem_open. If this works, personally I suggest to declare a higher order function to reuse in other places, namely for parallel downloading of packages (GH-825). Still under the assumption that this is correct, we can easily mock the failing behavior for testing. However, with my modest experience in threading and the overwhelming responsibility of not breaking thousands[citation needed, could be millions] of people's workflows, please do not take my words for granted and kindly share your thoughts on this particular matter.

@triage-new-issues triage-new-issues bot added the S: needs triage Issues/PRs that need to be triaged label Apr 29, 2020
@McSinyx McSinyx changed the title Multithreading and what to do when the platform does not support it Multithreading and unsupported platforms Apr 29, 2020
@pradyunsg pradyunsg added state: needs discussion This needs some more discussion type: maintenance Related to Development and Maintenance Processes labels Apr 29, 2020
@triage-new-issues triage-new-issues bot removed S: needs triage Issues/PRs that need to be triaged labels Apr 29, 2020
@pfmoore
Copy link
Member

pfmoore commented Apr 29, 2020

I think the broader question here is whether pip should support platforms that don't provide usable threading support. That's basically what @McSinyx said, but summarised down to the bare essential point.

From the Python documentation, the threading module is required in Python 3.7+ (before that it was optional), and multiprocessing.dummy is documented as just being a wrapper around threading. And concurrent.futures is available from Python 3.2, and I believe there's a backport as well.

We currently claim to support Python 3.5+ (I'm going to ignore Python 2, as we'll be dropping support for that in 2021, and it's not the real issue anyway). So on that basis, we need to cover platforms without threading1, at least until we drop Python 3.5 and 3.6 support.

Personally, I'd suggest that what we do is have a compatibility module that implements whatever concurrency primitives we want, and has fallbacks for non-threaded platforms. We can then unit-test those wrappers to ensure that we behave the same with or without threading, and then we use the wrappers wherever we need them in the rest of the code. Once we drop support for platforms without threading, we can decide whether to keep the wrappers or use the core features directly.

1 #8161 was actually reported on Python 3.8.2, on Android Termux. If we take the Python docs seriously, that platform is broken by not providing a working threading implementation. I don't know how we want to deal with that. Python on mobile is an important enough area that I can see core Python being sympathetic to the idea of not being too strict here. Luckily, the point is irrelevant for now if we are going to support platforms without threading anyway.

@pradyunsg
Copy link
Member

pradyunsg commented Apr 29, 2020

Python 3.5

We'll likely be dropping Python 3.5 the same time as Python 2.7 btw -- since Python 3.5 goes EoL in August / September 2020.

@uranusjr
Copy link
Member

uranusjr commented Apr 29, 2020

One potential consideration after 2021 is asyncio (if pip ever wants to use it). A lot of the async stuff use threading as a backend when whatever they want to do doesn’t have OS-level event loop support.

@dstufft
Copy link
Member

dstufft commented Apr 29, 2020

I do not think it is important to support non threading Pythons. We don't have to support it just because it is possible option.

@dstufft
Copy link
Member

dstufft commented Apr 29, 2020

That being said, I think longer term we are ideally using some form of async code instead of threading or multiprocessing directly (I would love to use trio as it is much better than asyncio imo, but it has some C stuff so it would be a much harder change).

@bmartinn
Copy link

bmartinn commented May 3, 2020

Hi @McSinyx I was not aware of this thread and just opened #8187
It has a reference implementation of parallelization of the install process.
Bottom line, x1.9 factor speed up :)

It should support python 2.7 and python 3.5 because it relies on ThreadPool and Pool , both are available on 2.7/3.5.
Notice that in the reference implementation the multi-threading support is turned off by default, only adding --parallel will actually use the ThreadPool/Pool. This way we still support platforms with limited multiprocessing capabilities.

Notice how we had solved the multi-instance progress bar download issue, I would love some feedback on our solution, any suggestions are welcome.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
state: needs discussion This needs some more discussion type: maintenance Related to Development and Maintenance Processes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants