-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Modification of Receive Split proposal. #4036
Conversation
|
||
### Drawbacks of the project | ||
There is no possible way to have a single-process receiver. The user must have a router + a receiver running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the only con, then there are simple solutions to solve this like adding the http endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is left overall, sorry, will be more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I clarified, could you double check?
I don't see sufficient reason here to change the direction of this project entirely when it's so close to the finish line, and two maintainers having signed off on the previous architecture. Also there is a docs PR out already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. With this new proposed implementation, it draws more clear migration path and it's backwards compatible. It would much easier to roll these out.
4. The new architecture enables further performance improvements. | ||
[@squat](https://github.com/squat): | ||
|
||
> Currently, any change to the hashring configuration file will trigger all Thanos Receive nodes to flush their multi-TSDBs, causing them to enter an unready state until the flush is complete. This unavailability during a flush allows for a clear state transition, however it can result in downtimes on the order of five minutes for every configuration change. Moreover, during configuration changes, the hashring goes through an even longer period of partial unreadiness, where some nodes begin and finish flushing before and after others. During this partial unreadiness, the hashring can expect high internal request failure rates, which cause clients to retry their requests, resulting in even higher load. Therefore, when the hashring configuration is changed due to automatic horizontal scaling of a set of Thanos Receivers, the system can expect higher than normal resource utilization, which can create a positive feedback loop that continuously scales the hashring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
not specifying: | ||
|
||
```yaml | ||
--receive.local-endpoint=RECEIVE.LOCAL-ENDPOINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
In comparison to previous proposal (as mentioned in [alternatives](#previous-proposal-separate-receive-route-command) we have big adventages: | ||
|
||
1. We can reduce number of components in Thanos system, we can reuse similar component flags and documentation. Users has to learn about one less command and in result Thanos design is much more approachable. Less components mean less maintainance, code and other implicit duties: Separate changelogs, issue confusions, boilerplates, etc. | ||
2. Allow consistent pattern with Query. We don't have separate StorAPI component for proxying, we have that baked into Querier. This has been proven to be flexible and understandable, so I would like to propose similar pattern in Receive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I understand your @brancz feelings and it's entirely our bad we did not participate in this discussion, however. It's my fault. I would still love if you could take a look on 3 aspects:
Is it entirely changing? From user there is no much different. Indeed it changing a lot of implementation wise - it avoids tons of work which is good?
I would argue with this. I don't think it's close for proper usage by users. Also it's creating breaking changes which violates the stability guarantee we made with receive. Of course we are not in 1.0 but I would expect some migration plan for receive useage. I mentioned reasoning in #4036 (comment) point 2. This proposal here allows to avoid tricky migration.
What's the difference if 1, two or hundred of maintainers signed off if the alternative we propose here was not considered (at least in the view of previously written proposal)?
|
@brancz @bwplotka I also want to see this push through the finish line as quickly as possible. If could just tweak the proposal and modify the existing implementation, it'd be faster to merge this. We can even make this happen for v0.20.0 by pushing the cut of RC for a week (we just released a major version recently anyways). I'm happy to help with the effort. Current implementation changes the interface of the existing commands drastically, there is no easy migration path using current implementation. Maybe we should acknowledge the proposed design and implementation drift apart. FWIW, either we can proposal or implementation, I want to have these changes on the main branch. Initial benchmarks by @metalmatze proved that there are nice gains in here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpick and few questions from my side.
Otherwise LGTM 🥇
I'm not necessarily arguing with the technological direction (I do agree this is a good, maybe better direction), but I find it worrisome that we let parts of the community work on something for a long time only to step in and "know it better" and turn over a lot of work someone has put into something. This is being everything else but welcoming to new contributors, and I've noticed this behavior more than once. Part of letting people grow is allowing them to make mistakes and not stepping in and having to be part of every conversation. Clearly, there are enough maintainers to back this proposal up from a technical perspective, so I'm not going to block this proposal, but for the record, I want to state that I find this behavior damaging for the Thanos project as a whole. Not everyone can always be part of every conversation and we need to accept that, even if it means technological decisions may be made that an individual may not necessarily agree with. |
Thanks, Frederic, this is important to all of us. As we discussed offline, there were lots of offline chats and even meeting between @kakkoyun, @Chans321 @yashrsharma44. The trigger for this proposal was actually done on pre-last Contributor Thanos Hours, which is also recorded and available here:
Unfortunately, you as the mentor of @Chans321 were not updated on all of those communications, sorry for that! It's true we should be welcoming, but it's also worth to mention we started this work in Community Bridge Q4 2020. Lot's of new ideas came after that I think are fair to consider.
Yea, but I treat this proposal as an improvement over the past work, all reusing the previous work and foundations. |
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, please PTAL again @yashrsharma44 @kakkoyun
Thanks for good review @yashrsharma44 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🔥 , just small questions about wording of the statement 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved the issues after Office Hours, looks good to me!
Co-authored-by: Yash Sharma <yashrsharma44@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo that one grammar nit. It indeed sucks a lot that we have some initial implementations of the previous version but at the same time I feel that it won't be too much work to update them for this improved version as we are only changing how this new mode will be turned on, right?
Co-authored-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Thank you all for feedback and reviews! |
👋🏽 @Chans321 @brancz @GiedriusS @kakkoyun @squat @yashrsharma44 @metalmatze
Sorry for not paying attention to the past proposal, but when we revisited and tried to review #3845 and #3580 with @kakkoyun we found the potentially easier idea that might be more cleaner for users and maintainers. Feel free to give feedback!
First of all, thanks for all of your hard work on this, and experimental implementations in PR 🤗 We only want to propose different configuration really, but the idea stays exactly the same.
Some learnings for the future:
Let's not merge things prematurely. The previous proposal was missing some important parts:
Right now the receiver handles ingestion, routing and writing, thus leading to too many responsibilities in a single process.
might need some proof.-Additionally, we touch possibility for adding a consistent hashing mechanism and buffering the incoming requests.
What that means?Implementation PR led to confusions.
Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com