Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Modification of Receive Split proposal. #4036

Merged
merged 3 commits into from
Apr 27, 2021
Merged

Proposal: Modification of Receive Split proposal. #4036

merged 3 commits into from
Apr 27, 2021

Conversation

bwplotka
Copy link
Member

@bwplotka bwplotka commented Apr 9, 2021

👋🏽 @Chans321 @brancz @GiedriusS @kakkoyun @squat @yashrsharma44 @metalmatze

Sorry for not paying attention to the past proposal, but when we revisited and tried to review #3845 and #3580 with @kakkoyun we found the potentially easier idea that might be more cleaner for users and maintainers. Feel free to give feedback!

First of all, thanks for all of your hard work on this, and experimental implementations in PR 🤗 We only want to propose different configuration really, but the idea stays exactly the same.

Some learnings for the future:

  1. Let's not merge things prematurely. The previous proposal was missing some important parts:

    • Goals were stating implementation goals, not the usage goals. We should focus on usage goals to avoid focusing too much on implementation and not seeing other options.
    • Proposal was missing Alternatives.
    • Motivation was motivating the implementation. I think it should rather motivate GOALS, not implementation yet.
    • Motivation was not clear. Statements like Right now the receiver handles ingestion, routing and writing, thus leading to too many responsibilities in a single process. might need some proof.
    • There were some left overs like -Additionally, we touch possibility for adding a consistent hashing mechanism and buffering the incoming requests. What that means?
    • The related ticket was for something else - for different database reconfiguration @squat proposed. Given that there was not enough context for this proposal.
    • Lack of migration guide for such breaking change of already stable receive.
  2. Implementation PR led to confusions.

  • It was too large, ideally we can split work into smaller pieces. It was hard to review properly, agree on all items, not miss important piece etc.
  • It was marked as almost ready, which led @yashrsharma44 to spend time on rebasing and cleaning all of it. Instead, it was missing many elements like documentation, migration docs, mixings, dashboards, alerts that should be at least well planned ahead. If we would merge any of those two in this state it could led to even bigger confusion.

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com


### Drawbacks of the project
There is no possible way to have a single-process receiver. The user must have a router + a receiver running.
Copy link
Member

@brancz brancz Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only con, then there are simple solutions to solve this like adding the http endpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is left overall, sorry, will be more explicit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified, could you double check?

@brancz
Copy link
Member

brancz commented Apr 9, 2021

I don't see sufficient reason here to change the direction of this project entirely when it's so close to the finish line, and two maintainers having signed off on the previous architecture. Also there is a docs PR out already.

docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
kakkoyun
kakkoyun previously approved these changes Apr 9, 2021
Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. With this new proposed implementation, it draws more clear migration path and it's backwards compatible. It would much easier to roll these out.

4. The new architecture enables further performance improvements.
[@squat](https://github.com/squat):

> Currently, any change to the hashring configuration file will trigger all Thanos Receive nodes to flush their multi-TSDBs, causing them to enter an unready state until the flush is complete. This unavailability during a flush allows for a clear state transition, however it can result in downtimes on the order of five minutes for every configuration change. Moreover, during configuration changes, the hashring goes through an even longer period of partial unreadiness, where some nodes begin and finish flushing before and after others. During this partial unreadiness, the hashring can expect high internal request failure rates, which cause clients to retry their requests, resulting in even higher load. Therefore, when the hashring configuration is changed due to automatic horizontal scaling of a set of Thanos Receivers, the system can expect higher than normal resource utilization, which can create a positive feedback loop that continuously scales the hashring.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

not specifying:

```yaml
--receive.local-endpoint=RECEIVE.LOCAL-ENDPOINT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

In comparison to previous proposal (as mentioned in [alternatives](#previous-proposal-separate-receive-route-command) we have big adventages:

1. We can reduce number of components in Thanos system, we can reuse similar component flags and documentation. Users has to learn about one less command and in result Thanos design is much more approachable. Less components mean less maintainance, code and other implicit duties: Separate changelogs, issue confusions, boilerplates, etc.
2. Allow consistent pattern with Query. We don't have separate StorAPI component for proxying, we have that baked into Querier. This has been proven to be flexible and understandable, so I would like to propose similar pattern in Receive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member Author

bwplotka commented Apr 9, 2021

I don't see sufficient reason here to change the direction of this project entirely when it's so close to the finish line, and two maintainers having signed off on the previous architecture. Also there is a docs PR out already.

I understand your @brancz feelings and it's entirely our bad we did not participate in this discussion, however. It's my fault. I would still love if you could take a look on 3 aspects:

  1. change the direction of this project entirely

Is it entirely changing? From user there is no much different. Indeed it changing a lot of implementation wise - it avoids tons of work which is good?

  1. when it's so close to the finish line

I would argue with this. I don't think it's close for proper usage by users. Also it's creating breaking changes which violates the stability guarantee we made with receive. Of course we are not in 1.0 but I would expect some migration plan for receive useage. I mentioned reasoning in #4036 (comment) point 2.

This proposal here allows to avoid tricky migration.

  1. , and two maintainers having signed off on the previous architecture

What's the difference if 1, two or hundred of maintainers signed off if the alternative we propose here was not considered (at least in the view of previously written proposal)?

  1. I explained the rationales in Proposal. The need of single-process receive and route you commented on is arguable and really low priority issue. The main issue is the user confusion of architectural design and each components purpose, boileplate of another command, inconsistency with Querier pattern. I would love to at least look and consider In my opinion easier solution.

@kakkoyun
Copy link
Member

kakkoyun commented Apr 9, 2021

@brancz @bwplotka I also want to see this push through the finish line as quickly as possible. If could just tweak the proposal and modify the existing implementation, it'd be faster to merge this. We can even make this happen for v0.20.0 by pushing the cut of RC for a week (we just released a major version recently anyways). I'm happy to help with the effort.

Current implementation changes the interface of the existing commands drastically, there is no easy migration path using current implementation. Maybe we should acknowledge the proposed design and implementation drift apart.

FWIW, either we can proposal or implementation, I want to have these changes on the main branch. Initial benchmarks by @metalmatze proved that there are nice gains in here.

Copy link
Contributor

@yashrsharma44 yashrsharma44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpick and few questions from my side.
Otherwise LGTM 🥇

docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
@brancz
Copy link
Member

brancz commented Apr 12, 2021

I'm not necessarily arguing with the technological direction (I do agree this is a good, maybe better direction), but I find it worrisome that we let parts of the community work on something for a long time only to step in and "know it better" and turn over a lot of work someone has put into something. This is being everything else but welcoming to new contributors, and I've noticed this behavior more than once. Part of letting people grow is allowing them to make mistakes and not stepping in and having to be part of every conversation.

Clearly, there are enough maintainers to back this proposal up from a technical perspective, so I'm not going to block this proposal, but for the record, I want to state that I find this behavior damaging for the Thanos project as a whole. Not everyone can always be part of every conversation and we need to accept that, even if it means technological decisions may be made that an individual may not necessarily agree with.

@bwplotka
Copy link
Member Author

Thanks, Frederic, this is important to all of us. As we discussed offline, there were lots of offline chats and even meeting between @kakkoyun, @Chans321 @yashrsharma44. The trigger for this proposal was actually done on pre-last Contributor Thanos Hours, which is also recorded and available here:

Topic: Thanos Contributor Office Hours
Date: Apr 1, 2021 11:00 AM Universal Time UTC

Meeting Recording:
https://zoom.us/rec/share/VoV2AVCrZYmvaPk27kSftXnTAtVEb-OQqkEiYVxfXNLFWfdSu_sNiAr7CL92888Z.seIHezcDVw2szx8f

Unfortunately, you as the mentor of @Chans321 were not updated on all of those communications, sorry for that!

It's true we should be welcoming, but it's also worth to mention we started this work in Community Bridge Q4 2020. Lot's of new ideas came after that I think are fair to consider.

Not everyone can always be part of every conversation and we need to accept that, even if it means technological decisions may be made that an individual may not necessarily agree with.

Yea, but I treat this proposal as an improvement over the past work, all reusing the previous work and foundations.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Copy link
Member Author

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please PTAL again @yashrsharma44 @kakkoyun

Thanks for good review @yashrsharma44 !

docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
Copy link
Contributor

@yashrsharma44 yashrsharma44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🔥 , just small questions about wording of the statement 😁

docs/proposals/202012_receive_split.md Outdated Show resolved Hide resolved
docs/proposals/202012_receive_split.md Show resolved Hide resolved
docs/proposals/202012_receive_split.md Show resolved Hide resolved
yashrsharma44
yashrsharma44 previously approved these changes Apr 22, 2021
Copy link
Contributor

@yashrsharma44 yashrsharma44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved the issues after Office Hours, looks good to me!

Co-authored-by: Yash Sharma <yashrsharma44@gmail.com>
GiedriusS
GiedriusS previously approved these changes Apr 26, 2021
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo that one grammar nit. It indeed sucks a lot that we have some initial implementations of the previous version but at the same time I feel that it won't be too much work to update them for this improved version as we are only changing how this new mode will be turned on, right?

Co-authored-by: Giedrius Statkevičius <giedriuswork@gmail.com>
@bwplotka bwplotka merged commit 45a684e into main Apr 27, 2021
@bwplotka bwplotka deleted the rule-split branch April 27, 2021 14:15
@bwplotka
Copy link
Member Author

Thank you all for feedback and reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants