Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GZip support to urllib3 #704

Merged
merged 8 commits into from
Mar 13, 2018
Merged

Adding GZip support to urllib3 #704

merged 8 commits into from
Mar 13, 2018

Conversation

robgil
Copy link
Contributor

@robgil robgil commented Jan 8, 2018

Adds new urllib3 class GzipEnabledConnection for handling gzip compression.

On bulk loads, this increased performance from 3000 doc/s to ~25k docs/s on my connection.

@robgil robgil requested a review from honzakral January 8, 2018 17:30
@robgil robgil requested a review from fxdgear January 9, 2018 15:43
@fxdgear
Copy link
Contributor

fxdgear commented Jan 9, 2018

@robgil great thanks for the PR. I'll take a look. 👍

@fxdgear
Copy link
Contributor

fxdgear commented Jan 9, 2018

can you add a test? Something simple to make sure that it's grabbing the right Class?

@robgil
Copy link
Contributor Author

robgil commented Jan 9, 2018

I'm trying to make this compatible with python2.7, but Elastic Cloud is barfing on this with 400 errors during bulk load. gzip.compress() seems to be the only one I can get to work.

    def perform_request(self, method, url, params=None, body=None, timeout=None, ignore=(), headers=None):
        gzip_body = BytesIO()
        with GzipFile(fileobj=gzip_body, mode='wb', compresslevel=3) as f:
            f.write(body)
        headers.update(urllib3.make_headers(accept_encoding=True))
        headers.update({'Content-Encoding': 'gzip'})
        conn = super().perform_request(method=method, url=url, params=params,
                body=gzip_body, timeout=timeout, ignore=(), headers=headers)
        return conn

When I run the unit tests against a local ES, I don't get any errors.

Copy link
Contributor

@honzakral honzakral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments in the code, but primarily I still have 2 unanswered questions:

  1. why do we want this, what benefit does it provide? Do we have any numbers that prove that this can lead to (significant) performance improvements?
  2. if answer to 1 is yes, do we then want this to be a separate class or just a (global) option? Is the connection class the best way to do this? Seems like just passing in headers={"content-encoding": '"gzip"} and a custom serializer that will output compressed bytes, completely bypassing the need for this to be part of the library.

Also what is missing from the code (and primarily, tests) is logging and error handling - both of these things now expect the body to be in json (see https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/connection/base.py#L52 for example). Maybe we need to rethink that logic too, if answer to 1 is a resounding yes.

"""
def perform_request(self, method, url, params=None, body=None, timeout=None, ignore=(), headers=None):
headers.update(urllib3.make_headers(accept_encoding=True))
headers.update({'Content-Encoding': 'gzip'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need to update these headers for every request, those should instead be set in self.headers in __init__ and then perform_request should just do gzip.compress.

@@ -8,8 +8,13 @@
from urlparse import urlparse
from itertools import imap as map
from Queue import Queue
StringIO = BytesIO = StringIO.StringIO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see StringIO used anywhere in the codebase, is there a reason those are included here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see note above about 2.7 compatibility. I was unable to get GzipFile to work with Elastic Cloud, but it works ok with a local install.

@@ -70,6 +70,62 @@ def test_ssl_context_and_depreicated_values(self):
self.assertRaises(ImproperlyConfigured, Urllib3HttpConnection, ssl_context=ctx, ca_certs="/some/path/to/cert.crt")
self.assertRaises(ImproperlyConfigured, Urllib3HttpConnection, ssl_context=ctx, ssl_version=ssl.PROTOCOL_SSLv23)

class TestGzipEnabledConnection(TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are only testing logic from the Urllib3HttpConnection, which is already tested and there is no need to retest it. What we need instead are tests for the gzip specific part - gzip accept headers and the fact that body is correctly compressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each of the tests, while redundant, they use the GzipEnabledConnection and tests all the same stuff but with the gzip class.

@robgil
Copy link
Contributor Author

robgil commented Jan 10, 2018

@honzakral See first comment regarding performance. You'd need to test this over a resource constrained network to see the performance improvement. All of the tests assume you're running a local ES server.

@honzakral
Copy link
Contributor

@robgil you are right, I completely missed the number in the original PR description, my bad!

In that case this makes sense. I think we'd want to have an optional flag which will set the correct headers in __init__ and then compress the body in perform_request of the original class (no subclassing, just add a conditional compression). Then we won't have to deal with the error handling or anything like that as it will remain the same.

Thanks again for raising this and sorry for the confusion from my part!

@robgil
Copy link
Contributor Author

robgil commented Feb 27, 2018

@honzakral @fxdgear re-worked and added the http_compress option. Any other thoughts on testing? That test I added is just a placeholder. I didn't notice a general "request" test that I could validate the headers and response compression. Any advice on unit testing would be great. I don't know whether this should be a mock or a real request against a local ES.

@fxdgear
Copy link
Contributor

fxdgear commented Mar 4, 2018

@robgil sorry to do this to you but can you please rebase?

I wanted to get the SSL fix in first

@fxdgear
Copy link
Contributor

fxdgear commented Mar 4, 2018

Also, can you please add to the docs

  1. how to use gzip
  2. when you'd want to use gzip vs not using gzip?

Thanks

Copy link
Contributor

@fxdgear fxdgear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rebase against master and also, can you please add to the docshow to use gzipwhen you'd want to use gzip vs not using gzip?

Thanks

@robgil robgil force-pushed the gzip branch 2 times, most recently from bfd661c to 3b03d64 Compare March 13, 2018 13:30
Copy link
Contributor

@honzakral honzakral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank for the work, Rob! I added a few nitpicky comments.

@@ -152,6 +155,10 @@ def perform_request(self, method, url, params=None, body=None, timeout=None, ign

request_headers = self.headers
if headers:
if self.http_compress == True:
headers.update(urllib3.make_headers(accept_encoding=True))
headers.update({'Content-Encoding': 'gzip'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the headers should be modified in __init__ so it only happens once at initializatin and not with every single request, see where self.headers are created. Also please use lowercase for content-encoding to be consistent.

@@ -32,6 +32,10 @@ def test_ssl_context(self):
)
self.assertTrue(con.use_ssl)

def test_http_compression(self):
con = Urllib3HttpConnection(http_compress=True)
self.assertTrue(con.http_compress)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also verify that con.headers are set properly, once the headers are set in __init__.

@@ -152,6 +159,8 @@ def perform_request(self, method, url, params=None, body=None, timeout=None, ign

request_headers = self.headers
if headers:
if self.http_compress == True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be outside of the if headers block, otherwise it will only be used if custome headers are specified.

if should be:

if self.http_compress and body:
    body = gzip.compress(body)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derp! Updating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good eye @honzakral

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants