Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade keycloak to version 24 #3796

Merged
merged 6 commits into from
Sep 4, 2024
Merged

Upgrade keycloak to version 24 #3796

merged 6 commits into from
Sep 4, 2024

Conversation

rocketeerbkw
Copy link
Member

@rocketeerbkw rocketeerbkw commented Aug 22, 2024

General Checklist

  • Affected Issues have been mentioned in the Closing issues section
  • Documentation has been written/updated
  • PR title is ready for inclusion in changelog

Database Migrations

  • If your PR contains a database migation, it MUST be the latest in date order alphabetically

We've been on Keycloak 21 for awhile, this PR upgrades to 24. A number of API changes have been made, so there is a refactor to account for those. I also was able to get rid of our keycloak-admin-client fork.

There are no migrations needed other than the normal keycloak upgrade process.

@shreddedbacon
Copy link
Member

The remoteshell test fails because the sshportal calls to keycloak to get the group information fails.

{"time":"2024-08-23T02:14:26.766545934Z","level":"ERROR","msg":"couldn't generate group name to project IDs map","query":{"sshFingerprint":"SHA256:kDhWiCrizJSFOAPdDcpVIOV3W9f2VlwtjUrDUudJgTg","namespaceName":"ci-features-control-k8s-remoteshell","projectID":181,"environmentID":37,"sessionID":"6e63bce76a21589a6942e50c58df156ce75ea69a664e2a7729013d303996c2bd"},"error":"couldn't get group ID for group: ci-group"}

Since sshportal handles all of the group/access/permission logic independently of the API, it will likely need to have updates done to it too to support the sparsegroup queries against keycloak, or it could just use the API

@smlx
Copy link
Member

smlx commented Aug 23, 2024

The problem turned out to be nothing really to do with ssh-portal. It's just an incompatibility with Keycloak permissions in newer versions. Details and backward-compatible fix here: #3797

I got the CI run to pass by logging in to Keycloak in CI and adding the view-users permission manually. So once this is rebased on #3797 it should go green.

@rocketeerbkw rocketeerbkw marked this pull request as ready for review August 28, 2024 21:50
Copy link
Member

@shreddedbacon shreddedbacon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked over this and had a play with it locally both in k3d/local-stack and docker compose. I wasn't able to detect anything wrong. The changes seem to be pretty straightforward.

I do have a question around the choice to not use redis for the sparse queries, given past experiences with keycloak performance issues. Maybe it will be fine now with sparse groups though. I guess we can find out in test after merging.

The only issue I found was during local testing in docker compose. The API hangs waiting for keycloak here. Restarting the API container locally manages to allow it to succeed though. But by the time restarting the API takes place, the seeding process has already failed, so needs to be re-run.

It's fine in kubernetes because the liveness probe eventually fails on the API pod and causes it to restart. This pod restart does the same thing as restarting the pod locally.

I tried adding a jank timeout like this which gives up on that first attempt after a few seconds which is enough to then ensure the next or subsequent attempts succeed. This works, but I'm sure there is probably a better way to do this.

services/api/src/models/group.ts Outdated Show resolved Hide resolved
services/api/src/models/group.ts Outdated Show resolved Hide resolved
services/api/src/models/group.ts Show resolved Hide resolved
@shreddedbacon shreddedbacon added this to the 2.21.0 milestone Sep 3, 2024
@rocketeerbkw
Copy link
Member Author

The only issue I found was during local testing in docker compose. The API hangs waiting for keycloak here. Restarting the API container locally manages to allow it to succeed though. But by the time restarting the API takes place, the seeding process has already failed, so needs to be re-run.

This is a weird one. I traced the error down to a fetch call that appears to crash Node.js. Doesn't matter if it's async or promise, not able to be caught by try/catch. For me it appears to be related to the 172.17.0.1 ip, if you remove the KEYCLOAK_URL env vars in docker-compose.yml does that fix it for you? If so, then it's probably a local specific issue and not something that would break in a real deployment?

@shreddedbacon
Copy link
Member

shreddedbacon commented Sep 4, 2024

The only issue I found was during local testing in docker compose. The API hangs waiting for keycloak here. Restarting the API container locally manages to allow it to succeed though. But by the time restarting the API takes place, the seeding process has already failed, so needs to be re-run.

This is a weird one. I traced the error down to a fetch call that appears to crash Node.js. Doesn't matter if it's async or promise, not able to be caught by try/catch. For me it appears to be related to the 172.17.0.1 ip, if you remove the KEYCLOAK_URL env vars in docker-compose.yml does that fix it for you? If so, then it's probably a local specific issue and not something that would break in a real deployment?

I'll have another look real quick. But I don't understand why it fails with the IP on the first attempt, but then restarting the pod succeeds. To me it seems like a hang like this with no timeout? is a bad thing to have.

The jank timeout wrapper I tried didn't require any manual intervention or changes to the docker-compose file. So I'm keen to understand the fetch crash a bit more and why the crash isn't captured by the try in that waitforkeycloak code?

@shreddedbacon
Copy link
Member

Yeah, removing KEYCLOAK_URL in the api service does indeed work. So I guess lets just do that for now.

But it would still be worth figuring out if this is just a weird local edge case, or could it happen in a running pod? It looks like port 3000 is inaccessible when the condition takes place, so in kubernetes it will just go into crash loop. But if it continues to encounter that bug then the API could be completely unusable.

Maybe I'm overreacting though 🙂

Copy link
Member

@shreddedbacon shreddedbacon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for now, we can evaluate performance issues in the test instance as required.

@tobybellwood tobybellwood merged commit 43b005e into main Sep 4, 2024
1 check passed
@tobybellwood tobybellwood deleted the keycloak-upgrade branch September 4, 2024 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants