-
-
Notifications
You must be signed in to change notification settings - Fork 22
Discussion: the urllib3 bleach-spike branch #1
Comments
I actually can't find the |
Sorry, it's in the root of the source tree in the bleach-spike branch: notes.org raw, notes.org formatted |
Ah, OK. I thought you had changed the default branch for your repo and that I was looking at |
Questions:
|
Currently I'm treating the backend API as an internal (non-public) interface, so I haven't worried too much about such refinements :-). (In fact ATM there isn't even any ABC created in the code, only in the notes.) Now, possibly it would make sense to make this a public API, at which point we would need to think about such things, but for now I don't care that much. The
Because I needed (It turns out that the way HTTP/1.1 works – and specifically given the way @Lukasa wants to handle it in v2 – any time you're sending you also want to be watching for an early response from the server, so
Sure, those are probably better, I'll rename. |
Oh, that's also a good idea :-). Done. |
Is it possible to pull together some sample code showing how we would select a backend and interact with it? |
Here's the tiny example on the home page, currently: >>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'http://httpbin.org/robots.txt')
>>> r.status
200
>>> r.data
'User-agent: *\nDisallow: /deny\n' For purposes of the proof-of-concept, let's say we want it to instead look like: >>> import urllib3
>>> from urllib3.backends.trio import TrioBackend
>>> http = urllib3.PoolManager(TrioBackend())
>>> r = await http.request('GET', 'http://httpbin.org/robots.txt') # <-- notice the await
>>> r.status
200
>>> r.data
'User-agent: *\nDisallow: /deny\n' Oh, hmm, the import trio
import urllib3
from urllib3.backends.trio import TrioBackend
async def main():
http = urllib3.PoolManager(TrioBackend())
r = await http.request('GET', 'http://httpbin.org/robots.txt', preload_content=False)
print(r.status) # prints "200"
print(await r.read()) # prints "User-agent: *\nDisallow: /deny\n"
trio.main(run) Twisted version: # Cribbing from https://twistedmatrix.com/documents/current/core/howto/defer-intro.html#coroutines-with-async-await
import urllib3
from urllib3.backends.twisted import TwistedBackend
from twisted.internet import reactor
from twisted.internet.defer import ensureDeferred
async def main():
http = urllib3.PoolManager(TrioBackend())
r = await http.request('GET', 'http://httpbin.org/robots.txt', preload_content=False)
print(r.status) # prints "200"
print(await r.read()) # prints "User-agent: *\nDisallow: /deny\n"
reactor.stop()
ensureDeferred(main())
reactor.run() And for the sync backend: import urllib3
from urllib3.backends.sync_backend import SyncBackend
async def main():
http = urllib3.PoolManager(SyncBackend(100, 100))
r = await http.request('GET', 'http://httpbin.org/robots.txt', preload_content=False)
print(r.status) # prints "200"
print(await r.read()) # prints "User-agent: *\nDisallow: /deny\n"
def run_secretly_sync_async_fn(async_fn, *args):
coro = async_fn(*args)
try:
coro.send(None)
except StopIteration as exc:
return exc.value
else:
raise RuntimeError("you lied, this async function is not secretly synchronous")
run_secretly_sync_async_fn(main) |
Is this what you had in mind regarding the bleaching script?
|
That's the basic idea, yeah, and probably fine for testing a proof of
concept. However, it is possible to preserve formatting with a bit more
cleverness, and without going all the way to 2to3. Here's the trick: the
tokenize module tells us the line and column for each token it finds. So
what one could do is to first use the tokenize module to find the actual
character spans for each token that should be replaced or deleted, and then
go back to the original formatted source string and make just those
changes. It's kind of fiddly, but not *too* bad I think. And at least it
only has to be implemented once :-).
…On Dec 27, 2017 07:00, "Quentin Pradet" ***@***.***> wrote:
Is this what you had in mind regarding the bleaching script? tokenize
does not preserve formatting, I guess lib2to3 would be needed for that.
from tokenize import tokenize, untokenize, ASYNC, AWAIT, NAME
ASYNC_TO_SYNC = {
'__aenter__': '__enter__',
'__aexit__': '__exit__',
'__aiter__': '__iter__',
'__anext__': '__next__',
# '__await__': '__iter__' ?
}
if __name__ == '__main__':
with open(sys.argv[1], 'rb') as f:
result = []
g = tokenize(f.readline)
for toknum, tokval, _, _, _ in g:
if toknum == NAME and tokval in ASYNC_TO_SYNC:
result.append((toknum, ASYNC_TO_SYNC[tokval]))
elif toknum not in [ASYNC, AWAIT]:
result.append((toknum, tokval))
print(untokenize(result).decode('utf-8'))
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<njsmith/urllib3#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAlOaMmFnQx26KEyc7JPeU0-ZpjkMnijks5tEluggaJpZM4OmOs5>
.
|
I see! https://gist.github.com/pquentin/188f139be4c1bd5731b54bb2f7c29e41 should be closer to what you are describing. I realize this is not needed for the proof of concept but it's a nice self-contained problem and it was fun to think about it. |
Note that gist puts all the tokens in the right position on the page, but it doesn't preserve comments or anything – it just fills in the empty space with space characters. It's possible to get even more fancy, by using the original source file to fill in all the whitespace/comments around the tokens. (One way to think of it: use |
@haikuginger Not sure if you've seen this, but as the new urllib3 maintainer I figured I should give you a heads up :-). This is an experiment in porting urllib3 to support both sync and async operation within a single codebase using h11, and it's just reached the point where it managed to complete its first async http request. |
There's some more discussion of whether close should be synchronous or asynchronous in #4 |
@pquentin (and anyone else who's interested): I've been thinking about the question of what a real deployment of this would look like, and in particular how we want to expose the sync + async APIs. On Python 2, of course, we want to just export a sync API, and that has to be done by "bleaching" the async codebase and shipping that. On Python 3, it's a little more subtle: we want to provide the async API, but we also want to provide a sync API for backwards compatibility and because, well, it's useful. Here we have two options: we could wrap all of the async API in a little async→sync shim like my Going through and shimming the entire public API sounds like a fair amount of work: shimming a function is easy, shimming a method requires creating a whole proxy class to wrap around the original class and reproduce its full API, and then there are more complicated cases like the upload code that will expect an async iterator to be passed. AFAICT this is all doable, but Py2 means we have to get the bleaching trick working, and then once we have that it's not clear that implementing shimming as well will give us any additional benefits to pay back all the additional work. And since by definition they produce the same public API, we can always add shimming later if we decide it's worth it. So for now I think we should forget about shimming, and instead figure out how to ship a bleached+unbleached library on Py3. Here's one way to do it:
Anyway, the overall strategy should be clear. And it ends up giving us a single package that can be distributed as a py2 + py3 compatible wheel, provides a backwards-compatible urllib3 sync API everywhere, and provides a full async API on pythons that can handle it. Does that make sense? Am I missing anything? |
👍 on the overall strategy. To make this less confusing to contributors, maybe we can actually leave the async code in the root of the package? |
Hmm, can you elaborate on what you're thinking of? Like we'd have |
I had not realized that we would have to remove the async version. Moving more files around than needed is bad, please ignore my suggestion. I have one question left regarding your suggestion, though. We not only want to preserve With the way Python works, as far as I can tell, when you type
Did I miss anything? |
Oh ick, that's unfortunate. Good catch though. And I think you're right that the way I guess one option would be to say hey, as part of v2 we're hiding all that internal stuff! But... I have no idea if the urllib3 maintainers would go for that. At least some breakage is unavoidable – e.g. at the top of that page we have documented that I guess in general this is going to have to be a long-term discussion, with a wide range of possible outcomes including "change nothing", "change some things", "okay this is Off the top of my head, I see three options that could qualify as "plausible" (maybe there are more?):
The last actually seems simplest for right now, though we know we might have to change it in the future... |
First, I would like to say that I'm really happy that we are discussing this, it means we have made meaningful progress. Let's not forget that! 😀 To avoid a Python 3 like situation, I would like to avoid breaking anything that we don't have to break, and since we're not renaming Regarding exceptions, I do believe it's important to have only one instance of each exception, so that an except clause with the "async version" of say, The first users of those documented internal APIs are tests, so we are going to feel the pain of migrating first hand! It looks like putting the sync version at the root of package means most synchronous tests will continue working as before! If that's true, it looks like it's going to be so much more easier that I would not be able to justify a cleaner approach. But of course, the most important thing is to choose an approach, and be prepared to change when we have more data. |
Hmm, so we'd have a file at _BLEACHED = False
if _BLEACHED:
class HTTPError(Exception):
...
else:
# Unbleached
from ..exceptions import * and then when we build, then that copies the file to And I suspect we'll eventually end up hacking at the tests quite a bit, since we'll want to migrate them into testing both the sync and async versions. But okay, yeah, backcompat is pretty hard to argue with. I just had another idea that I think maybe is the best; let's see what you think. What if we (a) put the async code in # urllib3/connectionpool.py
from ._sync.connectionpool import ConnectionPool, HTTPConnectionPool, HTTPSConnectionPool This is clunky and low tech, but there are a finite number of documented submodules, and each one contains a finite list of documented exports, so it can be done. It's also incredibly obvious and boring (which is good), and it gives a very explicit dividing line between public and private API (which seems like an important step for making urllib3 more maintainable going forward). |
@njsmith This sounds great! 👍 Let's try that. |
Hmm. Flask used to include code so that flask.ext.* is actually loaded from |
Right, it's possible to play games with |
I just remembered a potentially compelling reason why we might want to make shutdown async (so it'd be |
GOAWAY helps clients when the server decides it needs to gracefully shut down the connection (server restart, load balancing). The client can then transparently open another socket and use that for new traffic, thus avoiding the latency associated with looking up the name and connecting. GOAWAY helps the server because with HTTP/2 servers can spontaneously send additional resources which the client is going to need (e.g. style sheets or fonts). Not doing that on a connection that will go away ASAP is a good idea. In general, sometimes the underlying connection is more complex than a simple socket that can be closed indiscriminately. It's a good idea to always support the async version, rather than having to change client code when the server requires more complex handling. |
@smurfix if you're interested in this question you might want to check out #4 as well for context – it's a remarkably nuanced question. My bias was also initially for async close, but then I switched when we realized how many complications it would make in the API (e.g. there's a public class that implements the mapping interface and calls |
@njsmith I'm sorry that you essentially had to say this twice. I don't think we have to look at the content of the file: the tokenize modules does preserve comments (in the COMMENT token) and backslashes (implicitly by omitting the NEWLINE token). My script needs to be simplified (!) to handle a few edge cases like this, though. I also don't think we want to look at the content of the file. Sure, the bleaching script is only going to remove characters, so an in-place O(n) algorithm exists, but since we're dealing with Unicode and Python lists, it's not clear that this algorithm is any better that just using If I'm not missing anything and 1/ we don't have to do it and 2/ we do not want to do it, maybe we won't do it! :) |
Oh, sorry, I just figured we were ignoring the issue until later, since it's not really crucial :-). I forgot that tokenize keeps comments though! That does potentially simplify things. (Maybe it would be interesting as a test to run the script over a bunch of code that doesn't use async/await, like the 3.4 stdlib or something, and check whether it preserves the text? I don't know that we necessarily have to preserve the text exactly, but it'd at least be nice to know what any changes look like.) I'm not worried about speed at all, just correctness. |
Caught up on the past few months of conversation. I have no input on direction at this time, but am making myself available for a few hours a week if you need someone to just get stuff done. |
@thinkjson Cool! I'm currently waiting on direction to know what the strategy is to get the 215 commits from the urllib3 master branch. But I think the next step is to work on tests, any improvement would be welcome here. |
I think for to the master branch, we'll probably just need to go through them one at a time at some point and forward port each one. It'll be a bit tedious, but it's fewer PRs than it looks, and most of them looked pretty simple. But before we touch that I think we need to get the test suite and more functionality running first, because otherwise we won't be able to tell if our ports worked. On the test suite, I think the next goal is to get it running in some form, any form, from start to finish, under pytest. Like, we should define broken = pytest.mark.xfail
extra_broken = pytest.mark.skipif(True, reason="freezes") and then start hacking at the test suite adding |
I guess I should post an update, since the last comment here is way out of date. We are now up to date with urllib3 master, and you can run the test suite by doing |
Playing with this today — very impressed! |
FYI pip is installing pluggy 0.7.1. It's resolving the Workaround: diff --git a/tox.ini b/tox.ini
index b98e128..2c3f040 100644
--- a/tox.ini
+++ b/tox.ini
@@ -2,7 +2,9 @@
envlist = flake8-py3, py27, py34, py35, py36, py37, pypy, pypy3
[testenv]
-deps= -r{toxinidir}/dev-requirements.txt
+deps=
+ pluggy==0.6.0
+ -r{toxinidir}/dev-requirements.txt
commands=
# Print out the python version and bitness
pip --version |
Edit: current status
There's a lot of discussion and updates below, making this issue hard to read if you just want to know what the current status is. So tl;dr:
This is a project to add support to urllib3 (and ultimately the packages that depend on it like requests, boto, etc.) for async mode with multiple IO library backends (currently trio + twisted, and it's easy to add more), while keeping the original sync API mostly intact. For a big picture overview and demo see: urllib3/urllib3#1323
We're definitely looking for help to move this forward. If you want to help see below for some summary of what needs to be done, and you can post in this thread or join us in our chat if you want more information.
Things that are working:
Basic HTTP requests using sync mode, trio mode, twisted mode (again, see RFC: what do you think of this branch adding async support to urllib3? urllib3/urllib3#1323 for a demo, or check out the
demo/
directory)The test suite runs, though lots of tests are temporarily marked
xfail
orskip
.Things that need to be done (incomplete list)
Get the tests running on travis, so we can make sure new PRs aren't breaking thingsReview all the tests marked
xfail
orskip
and one-by-one make them pass (or delete them if they no longer apply). This will require fixing a bunch of things (timeouts, https, etc.)Make it so trio and twisted aren't import-time requirements.Take the syncifier code inAlso it would be nice to get it running under python 2.setup.py
and factor it out into its own project.Get the tests running on python 2 (they should work if you can get things set up, but we currently can't build on python 2 because of the previous bullet point, so testing is awkward)
Gettox
working (for now, useruntests.py
instead)Maybe make
close
methods async? (see Discussion: the urllib3 bleach-spike branch #1 (comment) and replies and links therein)Currently we're only testing the synchronous mode; we need to add tests for the different async backends. This probably requires making parts of the test suite async (with some kind of parametrization to test against multiple backends) and then using the syncifier script to generate a synchronous version of the tests to test the synchronous mode.
Add tests for new features, in particular servers that send an early response.
Eventually: we'll need to update the docs.
Perhaps set up some kind of more useful way of tracking this todo list
Getting started
So you want to help? Awesome!
Check out the
bleach-spike
branch from thepython-trio/urllib3
repoYou should be able to run the tests with
python runtests.py
(this currently requires python 3.5+)You should see lots and lots of output, and eventually get a message from pytest that looks like "683 passed, 241 skipped, 31 xfailed, 20 warnings".
Hack on the code to make the "passed" number go up, and the other numbers go down. Or start working on any of the other suggestions up above.
The below is the original post that started off this thread.
Original post
(See urllib3/urllib3#1228 for context)
The
bleach-spike
branch in this repo is based off of @Lukasa'sv2
branch in the main urllib3 repo. The idea is to experiment with converting urllib3 to support three backends/public APIs: plain sockets (like it does now, usingselect
/poll
internally), trio (with anasync
-based public API), twisted (with anasync
-based public API). [These were chosen as a representative sampling of approaches, i.e. if it works for these it'll probably also work for asyncio/tornado/curio/..., modulo specific limitations like asyncio's lack of starttls.] I'm putting some notes here because @merrellb asked if he could help, so :-).Here's the diff link: urllib3/urllib3@v2...python-trio:bleach-spike
The overall goal here is not to immediately create a polished product, but just to test the viability of this approach, so a bit of sloppiness is OK; the goal is to get a proof-of-concept as efficiently as possible.
The basic idea is: we define a common "backend API" that provides a common set of socket operations, and acts as the abstraction boundary between urllib3 and our different network backends.
notes.org
has some details on what my current draft of the API looks like.urllib3/backends/
has the actual backend implementations. The backend usesasync
/await
, but the plain socket implementation of it never actually takes advantage of this -- its methods never yield. The trio and twisted backends of course assume that they're running in a trio or twisted async context.Done:
sync_connection.py
in terms of the backend API, but again haven't actually tried running it. I think it's much more readable and reliable (once we fix the silly bugs) than the version inv2
. (The filename is now something of a misnomer, but I figured I'd leave it alone for now to make the diff more readable. See above re: proof-of-concept.)Todo:
async
/await
markers out fromsync_connection.py
out to all their callerssync_connection.py
that I didn't really understand, so I put in some best-guesses but it's probably wrong. Probably there are other interactions with the rest of the world that aren't quite settled as well.https
exercises unique pathsThe trio code currently assumes the existence of
trio.open_tcp_stream
, which, uh, doesn't exist yet. I realized today though that contrary to previous reports we don't need to robustifySSLStream
to make this work (i.e., python-trio/trio#198 is not actually a blocker). (The reason is that I changed the backend interface here a little bit in a way that both simplified the code here and also removed this problem. Specifically, before I was trying to useread_and_write_for_a_while
to send the request and then switch toreceive_some
to read the response headers; now I just use a single call toread_and_write_for_a_while
to do both.)Further items that maybe aren't on the critical path:
tokenize
module to automatically transform the library back into a synchronous library by deletingasync
andawait
tokens, plus replacing__aiter__
→__iter__
,__anext__
→__next__
XX
in the sourceCC: @merrellb
The text was updated successfully, but these errors were encountered: