Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix task cancellation authz on fulfilling cluster #109357

Merged

Conversation

albertzaharovits
Copy link
Contributor

This fixes task cancellation actions (i.e. internal:admin/tasks/cancel_child
and internal:admin/tasks/ban) not being authorized by the fulfilling cluster.
This can result in orphaned tasks on the fulfilling cluster.

@albertzaharovits albertzaharovits added >bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Jun 4, 2024
@albertzaharovits albertzaharovits self-assigned this Jun 4, 2024
@albertzaharovits albertzaharovits marked this pull request as ready for review June 4, 2024 16:36
@albertzaharovits albertzaharovits requested a review from a team as a code owner June 4, 2024 16:36
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Jun 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @albertzaharovits, I've created a changelog YAML for you.

@albertzaharovits
Copy link
Contributor Author

@n1v0lg The root cause here is that internal: actions should be authorized by the local system internal user in

if (SystemUser.is(authentication.getEffectiveSubject().getUser())) {
// this never goes async so no need to wrap the listener
authorizeSystemUser(authentication, action, auditId, unwrappedRequest, listener);
, because they are otherwise rejected in
} else {
logger.warn("denying access as action [{}] is not an index or cluster action", action);
auditTrail.accessDenied(requestId, authentication, action, request, authzInfo);
listener.onFailure(actionDenied(authentication, authzInfo, action, request));
}

But internal: actions from a remote cluster are not perceived as originating from the local system user (it is the remote system user), in which case they appear as internal: actions from a non-system user, which is rejected.

Copy link
Contributor

@n1v0lg n1v0lg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 We also discussed on Slack:

  1. We should backport this to 8.14.1
  2. This "breaks" in a mixed cluster setting (e.g., in an 8.13.x and 8.15.0 cluster). However, if the failure mode is simply a new exception (instead of the current authz failed one) it's not worth addressing this, esp. since RCS 2.0 is in beta before 8.13
  3. An assertion/log for when a "remote" system user fails authz on an internal action would to nice (either in this PR or a follow-up) to make this easier to detect in the future.

Happy to re-review if point 2 above ends up requiring some additional changes, but as it stands this looks ready to me. Thanks for tracking this down!

String asyncSearchId = (String) submitAsyncSearchResponseMap.get("id");
assertThat(asyncSearchId, notNullValue());
// wait for the tasks to show up on the querying cluster
assertTrue(waitUntil(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: based on Javadoc and this issue, assertBusy is apparently more canonical.

I'm slightly in favor of following that convention (and using assertBusy with an inner assertTrue) but I'm not pushy here. To be controversial, I think assertTrue(waitUntil(...)) is actually more readable... Still, I think since there is a convention and it's easy to follow, it's probably the right move to follow it (i.e., use assertBusy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the pointer! Pushed bb508f5

{
"name": "*:*",
"error_type": "exception",
"stall_time_seconds": 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how slow everything is I wonder if we want to start out more generous and go for a whole minute (or even longer). This is a pretty arbitrary suggestion and I don't have concrete data to back it, but I have a hunch that 30s will be exceeded in a slow CI run shortly after we merge this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, pushed 03807fb

@albertzaharovits
Copy link
Contributor Author

@n1v0lg

This "breaks" in a mixed cluster setting (e.g., in an 8.13.x and 8.15.0 cluster). However, if the failure mode is simply a new exception (instead of the current authz failed one) it's not worth addressing this, esp. since RCS 2.0 is in beta before 8.13

As discussed, I did a manual check for this.
On the querying cluster, if the cancellation action is either missing or is unauthorized, the log messages are almost identical:

final Throwable cause = ExceptionsHelper.unwrapCause(exp);
. On the fulfilling cluster, if the action is un-authorized we see a log entry
logger.warn("denying access as action [{}] is not an index or cluster action", action);
but if the action is missing there's no corresponding log entry
throw new ActionNotFoundTransportException(actionName);

That means that, given the the fix as it currently stands, it won't make it a worse experience when communicating with clusters that don't have the fix (e.g. 8.14.0). Just the log message will be slightly different.

@albertzaharovits
Copy link
Contributor Author

@n1v0lg

An assertion/log for when a "remote" system user fails authz on an internal action would to nice (either in this PR or a follow-up) to make this easier to detect in the future.

I pushed 1f700ab such that the log message on the fulfilling cluster includes the authentication. From that, a keen eye can spot if it's a remote system user or not. For example, this is a sample log error:

denying access for [Authentication[effectiveSubject=Subject{version=8676000, user=User[username=test_user,roles=[],fullName=null,email=null,metadata={}], realm={Realm[_es_cross_cluster_access._es_cross_cluster_access] on Node[fulfilling-cluster-0]}, type=CROSS_CLUSTER_ACCESS, metadata={_security_api_key_creator_realm_name=default_file, _security_api_key_limited_by_role_descriptors=org.elasticsearch.common.bytes.BytesArray@1323, _security_api_key_id=0wYw6Y8BRt8c6GrAors2, _security_api_key_type=cross_cluster, _security_cross_cluster_access_authentication=Authentication[effectiveSubject=Subject{version=8676000, user=User[username=_system,roles=[],fullName=null,email=null,metadata={}], realm={Realm[__attach.__attach] on Node[query-cluster-0]}, type=USER, metadata={}},type=INTERNAL], _security_api_key_creator_realm_type=file, _security_api_key_name=cross_cluster_access_key, _security_api_key_role_descriptors=org.elasticsearch.common.bytes.BytesArray@8def201, _security_cross_cluster_access_role_descriptors=[]}},type=API_KEY]] as action [internal:admin/tasks/ban] is not an index or cluster action

It's verbose AF, but _security_cross_cluster_access_authentication=Authentication[effectiveSubject=Subject{version=8676000, user=User[username=_system tells one that this is a remote system user.

I think this should aid debugging and covers the case of a remote system user getting rejected. WDYT?

@n1v0lg
Copy link
Contributor

n1v0lg commented Jun 6, 2024

I think this should aid debugging and covers the case of a remote system user getting rejected. WDYT?

@albertzaharovits sounds good!

@albertzaharovits albertzaharovits merged commit ed0febb into elastic:main Jun 6, 2024
20 checks passed
albertzaharovits added a commit to albertzaharovits/elasticsearch that referenced this pull request Jun 6, 2024
This fixes task cancellation actions (i.e. internal:admin/tasks/cancel_child
and internal:admin/tasks/ban) not being authorized by the fulfilling cluster.
This can result in orphaned tasks on the fulfilling cluster.
albertzaharovits added a commit that referenced this pull request Jun 6, 2024
…109422)

This fixes task cancellation actions (i.e. internal:admin/tasks/cancel_child and internal:admin/tasks/ban) not being authorized by the fulfilling cluster. This can result in orphaned tasks on the fulfilling cluster.

Backport of #109357
@albertzaharovits albertzaharovits deleted the test-for-cancelling-tasks branch June 6, 2024 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team v8.14.1 v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants