Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes some bugs in the limit connection feature #2964

Merged
merged 12 commits into from
May 4, 2018

Conversation

pfreixes
Copy link
Contributor

@pfreixes pfreixes commented Apr 28, 2018

There are two places which both try to release the next waiter. When the ongoing connection is being canceled calling explicitly to wake up a waiter and when the connection is being released, either because was called explicitly or implicitly due to the close of a response.

In both situations, we call the same method _release_waiter that will iterate per all of the
queues till found one that has a waiter that is not already canceled.

This iteration will also check that the limit and the limit per host is not overcome. The logic to check how many available connections/free slots there are has been placed behind the new _available_connections function.

So far this PR fixes:

  • Take into account the limit and limit per host at any time that we try
    to wake up a waiter. No connection available, no wake-up.
  • Don't give up till reach a waiter that is not finished.
  • Iterate always through all of the different hosts.

There are two places which try to release the next waiter, when the
ongoing connection is being canceled and we call explicitly to wake up
a waiter or when the connection is being released. In both situations we
call the same method `_release_waiter` that will iterate per all of the
queues till found one that has a waiter that is not already canceled.

This iteration will also check that the limit and the limit per host is
not overcomed. The logic to check how many available connections/free
slots there are has been placed behind the `_available_connections`
function.

So far this PR fixes:

- Take into account the limit and limit per host at any time that we try
to wake up a waiter. No availables, no wake up.
- Dont give up till reach a waiter that is not finished.
- Iterate always through all of the different hosts.

What is missing:

- More coverage
- Shuffle the keys to give always the same chances to all of the hosts?
@pfreixes
Copy link
Contributor Author

Having the concern of #1821, no degradation perceived with this PR

imit_semaphore:
  Responses: Counter({200: 9999})
  Took:      20.797451972961426 seconds
limit_builtin:
  Responses: Counter({200: 9999})
  Took:      20.224246501922607 seconds

@pfreixes
Copy link
Contributor Author

Also worth it if @fafhrd91 can take a look.

@codecov-io
Copy link

codecov-io commented Apr 28, 2018

Codecov Report

Merging #2964 into master will increase coverage by 0.17%.
The diff coverage is 96.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2964      +/-   ##
==========================================
+ Coverage   97.98%   98.16%   +0.17%     
==========================================
  Files          40       41       +1     
  Lines        7592     7842     +250     
  Branches     1323     1392      +69     
==========================================
+ Hits         7439     7698     +259     
+ Misses         51       47       -4     
+ Partials      102       97       -5
Impacted Files Coverage Δ
aiohttp/connector.py 98.16% <96.66%> (+1.29%) ⬆️
aiohttp/__init__.py 100% <0%> (ø)
aiohttp/web_app.py 99.16% <0%> (+0.06%) ⬆️
aiohttp/web_urldispatcher.py 99.6% <0%> (+0.5%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ba7bf1...ff39bb8. Read the comment docs.

@asvetlov
Copy link
Member

I'm at a conference, sorry.
Will try to find a time for review tomorrow.

@thehesiod
Copy link
Contributor

will try to carve out some time too later today, was working on a bunch of aiobotocore PRs that just landed, am hoping to switch to that and this fix in prod for the perf improvement.

@thehesiod
Copy link
Contributor

btw I was thinking, should a "FIFO semaphore" abstraction be refactored from this? I think it would make the main code clearer.

@asvetlov
Copy link
Member

asvetlov commented May 2, 2018

@thehesiod What "FIFO semaphore" abstraction is?

# remove a waiter even if it was cancelled, normally it's
# removed when it's notified
try:
waiters.remove(fut)
except ValueError: # fut may no longer be in list
pass

if not waiters:
if key in self._waiters and not self._waiters[key]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the case when waiters is not self._waiters[key]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing, fixed!

# Having the dict keys ordered this avoids to iterate
# at the same order at each call.
queues = list(self._waiters.keys())
random.shuffle(queues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need shuffling?
The operation is not free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that hurts me. But without the shuffle the iteration of the queues will be done always in the same order, having chances of starvation for certain queues.

We could try to get rid of that operation, but it will imply some important changes.

Copy link
Contributor Author

@pfreixes pfreixes May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of @thehesiod of having a "true" FIFO would get rid of this operation, but my gut feeling says that this FIFO queue will have enough footprint - has to take care of the limit per host - to make it slower than just a call shuffle.

PD: a shuffle operation takes around 1 microsecond for a list of 10 elements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiters is a deque.
deque.rotate() is cheap. Can we use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that _waiters is the dictionary that has the queues. So, unfortunately we can't use it.

We might implement the FIFO on top a unique deque using the rotate method as a way to skip a waiter that can't be deallocated because of the limit per host restriction. I didn't know but the slice of the head element is O(1), so the check and later the skip can be done without losing the deque properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. The PR's code is ok to me.

waiter = waiters.popleft()
if not waiter.done():
waiter.set_result(None)
found = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't break enough? It saves a check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The break does not work. Anyway changed to avoid the found check duplicating some code.

@thehesiod
Copy link
Contributor

@asvetlov basically what this code is trying to do is be a semaphore on the connectors, with the caveat that the waiters are released in order. So why I said a FIFO Semaphore :)

@asvetlov
Copy link
Member

asvetlov commented May 2, 2018

@thehesiod got you


return

if not waiters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is always true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it would be nice to have a test case where waiters become non-empty after releasing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it doesn't make sense right now let me get rid of this part of the code and I will check the tests.

# remove a waiter even if it was cancelled, normally it's
# removed when it's notified
try:
waiters.remove(fut)
except ValueError: # fut may no longer be in list
pass

raise e
finally:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asvetlov can you check this new change? My concern with the previous implementation was because having a del self._waiters[key] in two places was a tick bomb and hard to maintain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@pfreixes
Copy link
Contributor Author

pfreixes commented May 4, 2018

@asvetlov any clue about how can I increase the coverage?

@asvetlov
Copy link
Member

asvetlov commented May 4, 2018

I think how everything is perfect.

@asvetlov asvetlov merged commit 9116128 into master May 4, 2018
@asvetlov asvetlov deleted the fix_buggy_connection_limit branch May 4, 2018 06:59
@asvetlov
Copy link
Member

asvetlov commented May 4, 2018

thank you, @pfreixes

@pfreixes
Copy link
Contributor Author

pfreixes commented May 4, 2018

Thanks to you, any chance to get a new bugfix release?

@thehesiod
Copy link
Contributor

sorry I didn't have a chance to help review, been in a killer time sink project

@asvetlov
Copy link
Member

asvetlov commented May 4, 2018

@thehesiod no problem.
If you see something -- we can fix it in a separate PR.

@lock
Copy link

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that [new issue].
[new issue]: https://github.com/aio-libs/aiohttp/issues/new

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bot:chronographer:provided There is a change note present in this PR outdated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants