-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding GZip support to urllib3 #704
Conversation
@robgil great thanks for the PR. I'll take a look. 👍 |
can you add a test? Something simple to make sure that it's grabbing the right Class? |
I'm trying to make this compatible with python2.7, but Elastic Cloud is barfing on this with 400 errors during bulk load.
When I run the unit tests against a local ES, I don't get any errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments in the code, but primarily I still have 2 unanswered questions:
- why do we want this, what benefit does it provide? Do we have any numbers that prove that this can lead to (significant) performance improvements?
- if answer to
1
is yes, do we then want this to be a separate class or just a (global) option? Is the connection class the best way to do this? Seems like just passing inheaders={"content-encoding": '"gzip"}
and a customserializer
that will output compressed bytes, completely bypassing the need for this to be part of the library.
Also what is missing from the code (and primarily, tests) is logging and error handling - both of these things now expect the body to be in json (see https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/connection/base.py#L52 for example). Maybe we need to rethink that logic too, if answer to 1
is a resounding yes.
""" | ||
def perform_request(self, method, url, params=None, body=None, timeout=None, ignore=(), headers=None): | ||
headers.update(urllib3.make_headers(accept_encoding=True)) | ||
headers.update({'Content-Encoding': 'gzip'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need to update these headers for every request, those should instead be set in self.headers
in __init__
and then perform_request
should just do gzip.compress
.
elasticsearch/compat.py
Outdated
@@ -8,8 +8,13 @@ | |||
from urlparse import urlparse | |||
from itertools import imap as map | |||
from Queue import Queue | |||
StringIO = BytesIO = StringIO.StringIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see StringIO
used anywhere in the codebase, is there a reason those are included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see note above about 2.7 compatibility. I was unable to get GzipFile to work with Elastic Cloud, but it works ok with a local install.
@@ -70,6 +70,62 @@ def test_ssl_context_and_depreicated_values(self): | |||
self.assertRaises(ImproperlyConfigured, Urllib3HttpConnection, ssl_context=ctx, ca_certs="/some/path/to/cert.crt") | |||
self.assertRaises(ImproperlyConfigured, Urllib3HttpConnection, ssl_context=ctx, ssl_version=ssl.PROTOCOL_SSLv23) | |||
|
|||
class TestGzipEnabledConnection(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests are only testing logic from the Urllib3HttpConnection
, which is already tested and there is no need to retest it. What we need instead are tests for the gzip
specific part - gzip
accept headers and the fact that body is correctly compressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each of the tests, while redundant, they use the GzipEnabledConnection and tests all the same stuff but with the gzip class.
@honzakral See first comment regarding performance. You'd need to test this over a resource constrained network to see the performance improvement. All of the tests assume you're running a local ES server. |
@robgil you are right, I completely missed the number in the original PR description, my bad! In that case this makes sense. I think we'd want to have an optional flag which will set the correct headers in Thanks again for raising this and sorry for the confusion from my part! |
e0998ed
to
6d92572
Compare
@honzakral @fxdgear re-worked and added the |
@robgil sorry to do this to you but can you please rebase? I wanted to get the SSL fix in first |
Also, can you please add to the docs
Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please rebase against master and also, can you please add to the docshow to use gzipwhen you'd want to use gzip vs not using gzip?
Thanks
bfd661c
to
3b03d64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank for the work, Rob! I added a few nitpicky comments.
@@ -152,6 +155,10 @@ def perform_request(self, method, url, params=None, body=None, timeout=None, ign | |||
|
|||
request_headers = self.headers | |||
if headers: | |||
if self.http_compress == True: | |||
headers.update(urllib3.make_headers(accept_encoding=True)) | |||
headers.update({'Content-Encoding': 'gzip'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the headers should be modified in __init__
so it only happens once at initializatin and not with every single request, see where self.headers
are created. Also please use lowercase for content-encoding
to be consistent.
@@ -32,6 +32,10 @@ def test_ssl_context(self): | |||
) | |||
self.assertTrue(con.use_ssl) | |||
|
|||
def test_http_compression(self): | |||
con = Urllib3HttpConnection(http_compress=True) | |||
self.assertTrue(con.http_compress) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also verify that con.headers
are set properly, once the headers are set in __init__
.
@@ -152,6 +159,8 @@ def perform_request(self, method, url, params=None, body=None, timeout=None, ign | |||
|
|||
request_headers = self.headers | |||
if headers: | |||
if self.http_compress == True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be outside of the if headers
block, otherwise it will only be used if custome headers are specified.
if should be:
if self.http_compress and body:
body = gzip.compress(body)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Derp! Updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good eye @honzakral
Adds new urllib3 class GzipEnabledConnection for handling gzip compression.
On bulk loads, this increased performance from 3000 doc/s to ~25k docs/s on my connection.