Adding GZip support to urllib3 #704

robgil · 2018-01-08T17:29:56Z

Adds new urllib3 class GzipEnabledConnection for handling gzip compression.

On bulk loads, this increased performance from 3000 doc/s to ~25k docs/s on my connection.

fxdgear · 2018-01-09T17:20:27Z

@robgil great thanks for the PR. I'll take a look. 👍

fxdgear · 2018-01-09T17:22:17Z

can you add a test? Something simple to make sure that it's grabbing the right Class?

robgil · 2018-01-09T19:15:55Z

I'm trying to make this compatible with python2.7, but Elastic Cloud is barfing on this with 400 errors during bulk load. gzip.compress() seems to be the only one I can get to work.

    def perform_request(self, method, url, params=None, body=None, timeout=None, ignore=(), headers=None):
        gzip_body = BytesIO()
        with GzipFile(fileobj=gzip_body, mode='wb', compresslevel=3) as f:
            f.write(body)
        headers.update(urllib3.make_headers(accept_encoding=True))
        headers.update({'Content-Encoding': 'gzip'})
        conn = super().perform_request(method=method, url=url, params=params,
                body=gzip_body, timeout=timeout, ignore=(), headers=headers)
        return conn

When I run the unit tests against a local ES, I don't get any errors.

honzakral

I left some comments in the code, but primarily I still have 2 unanswered questions:

why do we want this, what benefit does it provide? Do we have any numbers that prove that this can lead to (significant) performance improvements?
if answer to 1 is yes, do we then want this to be a separate class or just a (global) option? Is the connection class the best way to do this? Seems like just passing in headers={"content-encoding": '"gzip"} and a custom serializer that will output compressed bytes, completely bypassing the need for this to be part of the library.

Also what is missing from the code (and primarily, tests) is logging and error handling - both of these things now expect the body to be in json (see https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/connection/base.py#L52 for example). Maybe we need to rethink that logic too, if answer to 1 is a resounding yes.

honzakral · 2018-01-10T08:58:53Z

elasticsearch/connection/http_urllib3.py

+    """
+    def perform_request(self, method, url, params=None, body=None, timeout=None, ignore=(), headers=None):
+        headers.update(urllib3.make_headers(accept_encoding=True))
+        headers.update({'Content-Encoding': 'gzip'})


there is no need to update these headers for every request, those should instead be set in self.headers in __init__ and then perform_request should just do gzip.compress.

honzakral · 2018-01-10T08:59:59Z

elasticsearch/compat.py

@@ -8,8 +8,13 @@
    from urlparse import  urlparse
    from itertools import imap as map
    from Queue import Queue
+    StringIO = BytesIO = StringIO.StringIO


I don't see StringIO used anywhere in the codebase, is there a reason those are included here?

Yes, see note above about 2.7 compatibility. I was unable to get GzipFile to work with Elastic Cloud, but it works ok with a local install.

honzakral · 2018-01-10T09:02:11Z

test_elasticsearch/test_connection.py

@@ -70,6 +70,62 @@ def test_ssl_context_and_depreicated_values(self):
        self.assertRaises(ImproperlyConfigured, Urllib3HttpConnection, ssl_context=ctx, ca_certs="/some/path/to/cert.crt")
        self.assertRaises(ImproperlyConfigured, Urllib3HttpConnection, ssl_context=ctx, ssl_version=ssl.PROTOCOL_SSLv23)

+class TestGzipEnabledConnection(TestCase):


these tests are only testing logic from the Urllib3HttpConnection, which is already tested and there is no need to retest it. What we need instead are tests for the gzip specific part - gzip accept headers and the fact that body is correctly compressed.

For each of the tests, while redundant, they use the GzipEnabledConnection and tests all the same stuff but with the gzip class.

robgil · 2018-01-10T13:16:20Z

@honzakral See first comment regarding performance. You'd need to test this over a resource constrained network to see the performance improvement. All of the tests assume you're running a local ES server.

honzakral · 2018-01-10T13:43:19Z

@robgil you are right, I completely missed the number in the original PR description, my bad!

In that case this makes sense. I think we'd want to have an optional flag which will set the correct headers in __init__ and then compress the body in perform_request of the original class (no subclassing, just add a conditional compression). Then we won't have to deal with the error handling or anything like that as it will remain the same.

Thanks again for raising this and sorry for the confusion from my part!

robgil · 2018-02-27T16:10:19Z

@honzakral @fxdgear re-worked and added the http_compress option. Any other thoughts on testing? That test I added is just a placeholder. I didn't notice a general "request" test that I could validate the headers and response compression. Any advice on unit testing would be great. I don't know whether this should be a mock or a real request against a local ES.

fxdgear · 2018-03-04T23:11:03Z

@robgil sorry to do this to you but can you please rebase?

I wanted to get the SSL fix in first

fxdgear · 2018-03-04T23:11:52Z

Also, can you please add to the docs

how to use gzip
when you'd want to use gzip vs not using gzip?

Thanks

fxdgear

Can you please rebase against master and also, can you please add to the docshow to use gzipwhen you'd want to use gzip vs not using gzip?

Thanks

honzakral

Looks good, thank for the work, Rob! I added a few nitpicky comments.

honzakral · 2018-03-13T13:45:47Z

elasticsearch/connection/http_urllib3.py

@@ -152,6 +155,10 @@ def perform_request(self, method, url, params=None, body=None, timeout=None, ign

            request_headers = self.headers
            if headers:
+                if self.http_compress == True:
+                    headers.update(urllib3.make_headers(accept_encoding=True))
+                    headers.update({'Content-Encoding': 'gzip'})


the headers should be modified in __init__ so it only happens once at initializatin and not with every single request, see where self.headers are created. Also please use lowercase for content-encoding to be consistent.

honzakral · 2018-03-13T13:46:16Z

test_elasticsearch/test_connection.py

@@ -32,6 +32,10 @@ def test_ssl_context(self):
        )
        self.assertTrue(con.use_ssl)

+    def test_http_compression(self):
+        con = Urllib3HttpConnection(http_compress=True)
+        self.assertTrue(con.http_compress)


we should also verify that con.headers are set properly, once the headers are set in __init__.

honzakral · 2018-03-13T15:37:21Z

elasticsearch/connection/http_urllib3.py

@@ -152,6 +159,8 @@ def perform_request(self, method, url, params=None, body=None, timeout=None, ign

            request_headers = self.headers
            if headers:
+                if self.http_compress == True:


this needs to be outside of the if headers block, otherwise it will only be used if custome headers are specified.

if should be:

if self.http_compress and body: body = gzip.compress(body)

Derp! Updating.

good eye @honzakral

robgil requested a review from honzakral January 8, 2018 17:30

pmoust approved these changes Jan 8, 2018

View reviewed changes

robgil requested a review from fxdgear January 9, 2018 15:43

hellysmile mentioned this pull request Jan 10, 2018

backport gzip features aio-libs/aioelasticsearch#72

Open

honzakral suggested changes Jan 10, 2018

View reviewed changes

robgil force-pushed the gzip branch 5 times, most recently from e0998ed to 6d92572 Compare February 27, 2018 16:02

fxdgear suggested changes Mar 7, 2018

View reviewed changes

Adding GZip support to urllib3

de58931

robgil force-pushed the gzip branch 2 times, most recently from bfd661c to 3b03d64 Compare March 13, 2018 13:30

Adding compression documentation and example

ab327fd

robgil force-pushed the gzip branch from 3b03d64 to ab327fd Compare March 13, 2018 13:40

honzakral reviewed Mar 13, 2018

View reviewed changes

fxdgear approved these changes Mar 13, 2018

View reviewed changes

Rob Gil added 2 commits March 13, 2018 11:22

Convert to lowercase for consistency

682034f

Moving header manipulation to __init__()

d1ca1ef

honzakral approved these changes Mar 13, 2018

View reviewed changes

Validating headers for compression

6986905

honzakral suggested changes Mar 13, 2018

View reviewed changes

Rob Gil added 2 commits March 13, 2018 11:39

Moving body compression out of the headers block

2adfb33

Don't compress if there is no body

438fc7d

honzakral approved these changes Mar 13, 2018

View reviewed changes

Infer true

246048d

fxdgear merged commit 64c125d into elastic:master Mar 13, 2018

This was referenced Mar 21, 2018

Bump elasticsearch from 6.0.0 to 6.2.0 BlueBrain/hpcbench#131

Closed

Bump elasticsearch from 5.4.0 to 6.2.0 lazycoderio/locators-perf#10

Closed

fxdgear mentioned this pull request Apr 16, 2018

GZip compressed bulk indexing #241

Closed

miku mentioned this pull request Apr 25, 2018

Check, if compressed content saves bandwidth or indexing time. miku/esbulk#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding GZip support to urllib3 #704

Adding GZip support to urllib3 #704

robgil commented Jan 8, 2018 •

edited

Loading

fxdgear commented Jan 9, 2018

fxdgear commented Jan 9, 2018

robgil commented Jan 9, 2018

honzakral left a comment

honzakral Jan 10, 2018

honzakral Jan 10, 2018

robgil Jan 10, 2018

honzakral Jan 10, 2018

robgil Jan 10, 2018

robgil commented Jan 10, 2018

honzakral commented Jan 10, 2018

robgil commented Feb 27, 2018

fxdgear commented Mar 4, 2018

fxdgear commented Mar 4, 2018

fxdgear left a comment

honzakral left a comment

honzakral Mar 13, 2018

honzakral Mar 13, 2018

honzakral Mar 13, 2018

robgil Mar 13, 2018

fxdgear Mar 13, 2018

Adding GZip support to urllib3 #704

Adding GZip support to urllib3 #704

Conversation

robgil commented Jan 8, 2018 • edited Loading

fxdgear commented Jan 9, 2018

fxdgear commented Jan 9, 2018

robgil commented Jan 9, 2018

honzakral left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robgil commented Jan 10, 2018

honzakral commented Jan 10, 2018

robgil commented Feb 27, 2018

fxdgear commented Mar 4, 2018

fxdgear commented Mar 4, 2018

fxdgear left a comment

Choose a reason for hiding this comment

honzakral left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robgil commented Jan 8, 2018 •

edited

Loading