Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch requests #93

Closed
wants to merge 3 commits into from

Conversation

dylanlingelbach
Copy link

@jlafon - here is my initial cut at hooking into the botocore response parsing.

Three things:

  • I would like to switch the PATCH_METHOD for the tests to be lower level, likely at the actual request level. Currently the tests don't actually exercise the botocore code so my change isn't really getting coverage
  • I haven't tested this yet but it should work
  • I haven't tested the performance the botocore response parsing vs this yet, but I'm not sure how what we have here would be any faster than what botocore has - both have to iterate through the response and convert to binary, timestamps, etc

Let me know how this looks to you

@danielhochman
Copy link
Contributor

I pulled your branch and played around with it a bit. Aside from a few minor errors in the code that I had to fix, it works. It circumvents the expensive recursive parsing that the normal JSON parser does.

I also took a look at your thoughts on #72. Some comments below.

After deploying this change we started seeing substantially higher error rates from our system, with the majority of them being network errors.

Agreed. The PR was merged before we had done significant load testing or considered alternative approaches such as this one. I also did not expect the connection to DynamoDB to be as unstable as it is.

Additionally this change prevents upgrading to newer versions of the requests library as later versions check the type of request object sent in. botocore creates an AWSPreparedRequest object, and requests expects a PreparedRequest object only.

It does prevent us from upgrading internally within the library, but should not affect the user of the library since we use a vendored version of requests from botocore. Not sure why this is a problem though, as with either approach you are bound to the version of requests that botocore uses to make requests under the hood.

botocore has some very handy features (like retrying after a network error) that get lost with this change. I think that is why PR #83 was created.

I strongly dislike botocore's default retry policy. It's another reason I wanted to stop using botocore to make the request. The default retry policy is terrible for any system taking user traffic. It is only configurable through some poorly documented botocore overrides using dotfiles.

The retry algorithm is time_to_sleep = base * (growth_factor ** (attempts - 1)). This can quickly balloon a request to 20+ seconds. Tolerable if you're doing backend processing or correcting the occasional network error, but really terrible in the case that you are doing thousands of requests on a node and an event causes all of them to enter extended backoff. This has caused an outage for us in the past. Additionally, there isn't the option for fine-grained retry control per model or operation.

For the reasons stated above, I'd like to stick to the current approach. We can add features to give it parity with the old botocore behavior in addition to making it more configurable, as well as configurable from the library using Meta or some other construct. We already see the benefit of configurability with the session class from #91, which we can use to change the size of the connection pool without having to monkey patch botocore.

Willing to consider the other side of this and help implement it in either case. Thoughts? @dylanlingelbach @gtowne @joshowen @jlafon

@dylanlingelbach
Copy link
Author

@danielhochman - comments below:

Additionally this change prevents upgrading to newer versions of the requests library as later versions check the type of request object sent in. botocore creates an AWSPreparedRequest object, and requests expects a PreparedRequest object only.

It does prevent us from upgrading internally within the library, but should not affect the user of the library since we use a vendored version of requests from botocore. Not sure why this is a problem though, as with either approach you are bound to the version of requests that botocore uses to make requests under the hood.

It's been a while so my memory is a bit fuzzy here. I think I meant one of two things - either this prevented us from upgrading our version of botocore (which had a newer version of requests vendored) or this prevented us from hacking around this in a different way. In either case, I don't think this concern is that important.

I strongly dislike botocore's default retry policy. It's another reason I wanted to stop using botocore to make the request. The default retry policy is terrible for any system taking user traffic. It is only configurable through some poorly documented botocore overrides using dotfiles.

The retry algorithm is time_to_sleep = base * (growth_factor ** (attempts - 1)). This can quickly balloon a request to 20+ seconds. Tolerable if you're doing backend processing or correcting the occasional network error, but really terrible in the case that you are doing thousands of requests on a node and an event causes all of them to enter extended backoff. This has caused an outage for us in the past. Additionally, there isn't the option for fine-grained retry control per model or operation.

The botocore default retry policy is based on the AWS recommendation. I think your points are fair. Did you explore allowing replacing the botocore retry logic via a similar method to how I replaced the JSON parser? Or at the very least allowing a model to override the max retries (which would give you the current logic)? That feels like a good balance of preserving the default botocore behavior for those who want it (like us) while allowing the flexibility others need.

For the reasons stated above, I'd like to stick to the current approach. We can add features to give it parity with the old botocore behavior in addition to making it more configurable, as well as configurable from the library using Meta or some other construct. We already see the benefit of configurability with the session class from #91, which we can use to change the size of the connection pool without having to monkey patch botocore.

Wouldn't it be less effort to have botocore support the hooks needed to configure retries, connection pool size, etc? It is very hard for us to get the behavior we want (default botocore) with the current approach.

To be fair, I haven't looked at this since December so my thinking could be out of date.

Hope this helps explain my thinking

@danielhochman
Copy link
Contributor

The botocore default retry policy is based on the AWS recommendation.

Right, we have to remember the goals of the AWS recommendations and botocore. Botocore was built for compatibility, not performance, which is why its parser is so slow. The AWS recommendations are meant to apply to all AWS APIs, most of which deal with resource orchestration in EC2, not live user traffic. They make little sense for an online database.

I think your points are fair. Did you explore allowing replacing the botocore retry logic via a similar method to how I replaced the JSON parser? Or at the very least allowing a model to override the max retries (which would give you the current logic)? That feels like a good balance of preserving the default botocore behavior for those who want it (like us) while allowing the flexibility others need.

I am going to look into that. It wouldn't give us the granularity we might want per-operation, but will give us per-table. What makes me uncomfortable though is that these overrides are so poorly documented. In the alternate approach we're in a position to replace botocore altogether with a request signing library. If we go with this PR's approach we are coupling tightly to botocore without using its primary offering which is an interface to the API.

Wouldn't it be less effort to have botocore support the hooks needed to configure retries, connection pool size, etc? It is very hard for us to get the behavior we want (default botocore) with the current approach.

I really don't think we're that far away from parity once we add retry. It's not that complex though it's obvious that it's poorly understood given how disruptive my original changes were.

Still could go either way on this. Looking forward to thoughts from @jlafon on which direction he wants to go in.

@dylanlingelbach
Copy link
Author

I didn't realize performance was valued over compatibility for this library. I had assumed this library was about providing a simpler, Pythonic, interface over DynamoDB - not that it was a high performance DynamoDB library. If performance is valued over compatibility, I don't think what I have here is the right direction. If this is a high performance library I assume you'll want to go the route of pulling out botocore entirely and writing directly against the API to optimize.

Let me know what you think.

@danielhochman
Copy link
Contributor

@dylanlingelbach I don't think either approach meaningfully affects compatibility. Pynamo formulates the entire request body and implements the DynamoDB API either way.

The tradeoff here is the amount of code that we have to maintain for retry in exchange for dramatically better performance. I don't think it's too much code to maintain. I would consider sticking with botocore and modifying its parsing/retry behavior if it was well documented and easy to configure local to our library. After some discussion with @jlafon, I have decided to move us away from botocore. #121 implements retry for RequestExceptions which should eliminate 99% of the pain that users of the library are having with the current approach. I am going to take a more measured approach to implementing retry for ProvisionedThroughputExceptions and the like rather than blanketing the same retry logic on top of them. It should come within the next month though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants