Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ProxyStore timeouts #1804

Closed
wants to merge 1 commit into from

Conversation

IKSIN
Copy link
Contributor

@IKSIN IKSIN commented Nov 27, 2019

We want to increase stability Thanos infrastructure with many stores/sidecars (49 in our case) when some stores respond slowly.

Changes

  • Fix and improve TestProxyStore_SeriesSlowStores:
    • Fixed broken test
    • Add some use-cases
    • check on elapsed time on get data from stores.
  • Change logic in ProxyStore:
    • store response timeout moved from MergedSet.Next() function to goroutine in startStreamSeriesSet to check timeouts in parallel
    • add ProxyStore metrics

More details in comments.

Verification

  • Tests
  • On production in our system with 49 store backends already 2 weeks.

Details

Old ProxyStore logic:

Image from iOS

New ProxyStore logic:

Image from iOS (1)

This PR fixes the case when 2 or more stores are responding slowly.
It is the case if, for example,
a) common S3 storage degraded -> all stores become slow
b) multiple prom instances deployed to the same host and the host is under high CPU pressure.
new messages

This PR also fixes double timeout in case of warnings

Also, we’ve separated RT metrics with/without payload:
before this change we’ve seen pretty low RT for all stores because most of queries doesn’t return any data. But if store return any payload - RT usually is much more.

This PR change works well in most of cases because if store responding slowly - it’s usually since first data chunk.

Example:
You have 3 stores backed by Ceph and 3 fast proms/sidecars
Ceph is degraded and responding very slowly (~infinite time).
Query timeout = 2m, response timeout = 10s
As it was before:
We will wait at least 10s X 3 because we’ve started response timeout timer inside MergedSet -> Next()
Now:
We will wait at least 10s, because timeout works in parallel way for all stores.

Signed-off-by: Aleksey Sin <asin@ozon.ru>
@IKSIN IKSIN mentioned this pull request Nov 27, 2019
9 tasks
@IKSIN
Copy link
Contributor Author

IKSIN commented Nov 27, 2019

This is recreated PR from #1789 (with fixed commits)

@bwplotka
Copy link
Member

hm you can rebase the previous branch (: I already commented there ):

@IKSIN IKSIN closed this Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants