Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(consolerole): Add support for namespaced token used for interactive console access to containers #148

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

giannoul
Copy link

This PR adds the Service Account token needed for deis ps:console (see feat(pkg/console): support interactive console PR on workflow-cli )

The feature itself consists of the current PR and the feat(pkg/console): support interactive console PR on workflow-cli counterpart. The idea is to mimic the mechanism that kubectl is using but with using the k8s API itself. Specifically we can get a websocket access to a console in a pod but we need a token. In order to be able to use the token we need to:

  • create a clusterrole named deis:deis-console that will actually have the pods/exec permission
  • add the needed permissions to deis:deis-controller in order to be able to attach the deis:deis-console clusterrole to a Service Account that belongs to a single namespace
  • create a Service Account and a token (secret) when we create a hephy application
  • instruct hephy controller to send that token when asked (this means that the user has access to the requested application)

The above ensures that:

  1. the token can be obtained only from the users that can access an application
  2. no global token is used
  3. the token will be cleared once the application is destroyed (since it belongs to the namespace of the application)

The requirements are to set the variables for the K8S_API_ENDPOINT and CONTAINER_CONSOLE_ENABLED.

I tested it on my minikube with setting:

ip-10-0-0-119 ~ # minikube ip
10.0.0.119
ip-10-0-0-119 ~ # kubectl describe deployment deis-controller -n deis | grep "K8S_API_ENDPOINT\|CONTAINER_CONSOLE_ENABLED"
      CONTAINER_CONSOLE_ENABLED:             true
      K8S_API_ENDPOINT:                      https://10.0.0.119:8443

An application named testadminapp3 had a service account and token attached that was able to give me pod/exec access:

ip-10-0-0-119 ~ # kubectl auth can-i list pods --subresource=exec --as=system:serviceaccount:testadminapp3:deis-console-testadminapp3 --namespace=testadminapp3
yes

@Cryptophobia
Copy link
Member

Cryptophobia commented Jul 15, 2021

@giannoul , thank you for the contribution!

Is there a way to make the tokens ephemeral, requested on-demand, and expire. Right now, it means that the tokens will be created when the app is created and the SA/token remains there until app/namespace is deleted. Can we create a route that will generate console-token for let's say 20-30 min sessions.

https://github.com/teamhephy/controller/pull/148/files#diff-79aefc2bc0c74e445347a254ecfa621edf7e334f1c8a3efd8358008427d48fb3R236

Is it possible to set expiration on the JWT token using k8s tricks like:

expirationSeconds: 3600 #expires in 60 mins

https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume

This would mean hephy workflow cli requests a new token each time or controller uses a sort of cache for tokens. However, this may not be possible at this time.

The other way that this can be done is to store the service_account_name, service_account_creation_timestamp, service_account_expiration_timestamp in the django app's db model. Once timestamp reaches service_account_expiration_timestamp we can consider this SA token invalid and perform a periodic cronjob in the controller to delete the tokens from the api. Then the on_delete method on the django model will also delete the SA from the app's namespace by sending a call via k8s client. SA will only be created when a CLI call wants to create console session. If token exists, return it. If not, create it. If expired, recreate the SA.

https://stackoverflow.com/questions/11788821/best-way-to-delete-a-django-model-instance-after-a-certain-date

@giannoul
Copy link
Author

@Cryptophobia Investigating the token expiration option I found the following:

1. An existing session is independent of the token

As we may see from the implementation, all users will use the same underline Service Account in order to get a console access to the container within the pod. Since the console access is actually a back and forth using websockets, the Service Account token is just used in order to initialize the connection. Experimenting with that I saw that for an existing console session, even if I deleted the token, the connection was left intact. This means no disruptions for the existing console connections when/if we remove the Service Account token (secret).

2. Service Account tokens are automatically recreated

If we delete the token (basically it is a k8s secret) it will get regenerated automatically. This seems to be the default function of the token-controller. During my testing I found that deleting the token via kubectl will just lead to a new token creation with a different value and name suffix.

A way to implement the token expiration

Since:

  • the token deletion will not disrupt the existing console sessions
  • all users use the same token since they effectively utilize the single Service Account in the namespace
  • the token of the Service Account will be regenerated once deleted
  • the token is basically a k8s secret with the field creationTimestamp stating when it was created

we can regenerate the token upon a console session request and return the new token. This basically means that upon a new console request we check if the existing token is older than e.g. 20mins and delete it (the k8s token-controller will create a new one) and return the new token in order to be used. The connections using the old token will continue to operate and the new ones will use the new token.

The above approach is actually invalidation on demand and not cron job based action, but it has a lot less moving parts.

@Cryptophobia
Copy link
Member

Okay @giannoul, the above seems like a good solution for ephemeral tokens. Does that mean that the session created by an old token can continue to be open forever? We should also create a mechanism to break the session every 20 mins to retry with same token as previously given when creating the session. This ensures that a long-running connection cannot use a very old token.

This is similar to how SSH long-running open connections are closed via timeouts.

If the above works then we should go with this new approach and add a timeout to make sure a long-running connection is not a security risk.

@giannoul
Copy link
Author

The sessions opened by any token may remain open (idle) for as long as the kubelet parameter --streaming-connection-idle-timeout says:

The above means that a session can be idle for a maximum of ~4 hours by default.

The other end that could terminate the connection would be the gorilla websocket
but it does not set such a timeout by default:

The above mean that the idle connections should be terminated after the time instructed by kubelet's --streaming-connection-idle-timeout.

In general the session itself is controlled via the Kubernetes and the workflow-cli via the gorilla websockets. This means that hephy controller is not able to actually terminate anything. It is just the intermediate that passes the token to the workflow-cli.

So, the only way to avoid the security risk you mentioned would be to set a hard timeout on the workflow-cli, but I am afraid that the first thing that will be requested afterwards would be "how to increase the timeout" 🤣.

@Cryptophobia
Copy link
Member

So, the only way to avoid the security risk you mentioned would be to set a hard timeout on the workflow-cli, but I am afraid that the first thing that will be requested afterwards would be "how to increase the timeout"

Adding the hard timeout on the workflow-cli client is easily bypassed and is no security at all... so there would be no point to do it for security anyways.

Is there any other way we can terminate the websocket on a set timeout using controller's permissions? For example, a simple solution like set an ENV var for WEBSOCKET_TIMEOUT on controller. It would would set that value when the workflow-cli client sets the Keep-Alive for the websocket connection. Then immediately after the token is returned it will recreate it on the controller side. If a get for a token comes in, wait 20 seconds then recreate the token immediately. That way the session is guaranteed to timeout and the token is no longer valid after establishing this session, a token is only valid per single session.

This will ensure if a user is deleted, their access is gone as soon as session expires. Still not perfect, but the token will be recreated each time.

@Cryptophobia
Copy link
Member

@giannoul , any update on this? I would like to get out a new minor release of hephy soon. If you are still working on this one we can get it out for next release.

@giannoul
Copy link
Author

I didn't get the chance to investigate your suggestion due to a very busy program these days. Please proceed to the minor release without this one.

@Cryptophobia
Copy link
Member

Okay, thank you for getting back so soon. I can help push this forward for the subsequent release after next.

@giannoul
Copy link
Author

giannoul commented Oct 5, 2021

The Header Keep-Alive cannot be set for the websocket connection. In order to create the functionality we discussed I did the following:

  • on the controller I added the parameter (env variable) named CONTAINER_CONSOLE_WEBSOCKET_TIMEOUT which is the timeout in seconds. This parameter is sent to the workflow-cli along with the token, when a user requests access to a pod.
  • on the controller, I added a custom ResponseWithCallback class that sends the requested body and then executes a callback function. That function in our case is the re-creation of the token. This means that immediately after receiving the token, it gets re-created.
  • on the workflow-cli I added a context with timeout that closes the channel after the timeout

@Cryptophobia
Copy link
Member

Awesome work @giannoul ! Thank you for working with me on a bit more secure design. Will get this in right after the patch release v2.23.1 coming up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants