-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix task cancellation authz on fulfilling cluster #109357
Fix task cancellation authz on fulfilling cluster #109357
Conversation
…asks/ban] is not an index or cluster action
Pinging @elastic/es-security (Team:Security) |
Hi @albertzaharovits, I've created a changelog YAML for you. |
@n1v0lg The root cause here is that Lines 315 to 317 in 6955427
Lines 535 to 539 in be502cc
But internal: actions from a remote cluster are not perceived as originating from the local system user (it is the remote system user), in which case they appear as internal: actions from a non-system user, which is rejected.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 We also discussed on Slack:
- We should backport this to 8.14.1
- This "breaks" in a mixed cluster setting (e.g., in an 8.13.x and 8.15.0 cluster). However, if the failure mode is simply a new exception (instead of the current authz failed one) it's not worth addressing this, esp. since RCS 2.0 is in beta before 8.13
- An assertion/log for when a "remote" system user fails authz on an internal action would to nice (either in this PR or a follow-up) to make this easier to detect in the future.
Happy to re-review if point 2 above ends up requiring some additional changes, but as it stands this looks ready to me. Thanks for tracking this down!
String asyncSearchId = (String) submitAsyncSearchResponseMap.get("id"); | ||
assertThat(asyncSearchId, notNullValue()); | ||
// wait for the tasks to show up on the querying cluster | ||
assertTrue(waitUntil(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: based on Javadoc and this issue, assertBusy
is apparently more canonical.
I'm slightly in favor of following that convention (and using assertBusy
with an inner assertTrue
) but I'm not pushy here. To be controversial, I think assertTrue(waitUntil(...))
is actually more readable... Still, I think since there is a convention and it's easy to follow, it's probably the right move to follow it (i.e., use assertBusy
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for the pointer! Pushed bb508f5
{ | ||
"name": "*:*", | ||
"error_type": "exception", | ||
"stall_time_seconds": 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how slow everything is I wonder if we want to start out more generous and go for a whole minute (or even longer). This is a pretty arbitrary suggestion and I don't have concrete data to back it, but I have a hunch that 30s will be exceeded in a slow CI run shortly after we merge this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, pushed 03807fb
As discussed, I did a manual check for this. elasticsearch/server/src/main/java/org/elasticsearch/tasks/TaskCancellationService.java Line 200 in 17c6230
Line 536 in 17c6230
elasticsearch/server/src/main/java/org/elasticsearch/transport/InboundAggregator.java Line 47 in 17c6230
That means that, given the the fix as it currently stands, it won't make it a worse experience when communicating with clusters that don't have the fix (e.g. 8.14.0). Just the log message will be slightly different. |
I pushed 1f700ab such that the log message on the fulfilling cluster includes the authentication. From that, a keen eye can spot if it's a remote system user or not. For example, this is a sample log error:
It's verbose AF, but I think this should aid debugging and covers the case of a remote system user getting rejected. WDYT? |
@albertzaharovits sounds good! |
This fixes task cancellation actions (i.e. internal:admin/tasks/cancel_child and internal:admin/tasks/ban) not being authorized by the fulfilling cluster. This can result in orphaned tasks on the fulfilling cluster.
This fixes task cancellation actions (i.e.
internal:admin/tasks/cancel_child
and
internal:admin/tasks/ban
) not being authorized by the fulfilling cluster.This can result in orphaned tasks on the fulfilling cluster.