-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[auth] Allow use of cloud access tokens for hail batch authorization #13131
Conversation
6b4be59
to
08b2eec
Compare
8ca6896
to
e231540
Compare
@danking This is far from done, evidenced by the fact that for some reason I'm getting tons of QoB test failures (but some are passing! :/) but I would appreciate a first pass on this if you want to do a high-level review. I do have a rough RFC that is not up to date, so let me know if you'd prefer to start the discussion from such a doc instead of the code. |
40282f5
to
b0dd11e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems broadly what I expected. I didn't carefully review the auth flow yet.
The idea is that we're either using:
- latent credentials like google application default or the azure managed identity.
- An OAuth2 flow to get authorization to use the user's email/openid.
In either case, we get an OAuth2 access token for email or openid, we send that to Hail which uses it to verify that the caller produced a valid email/openid access token. We accept that as proof of identity. The user sends that along with every request as proof it is still logged in. If the token expires, the client requests a new one and sends that along in the next request.
We are gonna use either a "hail" audience or create our own scopes to ensure that Google doesn't let arbitrary third parties steal these tokens and log into, say, Terra.
Does that all check out?
hail/scripts/upload_qob_jar.sh
Outdated
@@ -22,5 +22,6 @@ else | |||
JAR_LOCATION="${TEST_STORAGE_URI}/${NAMESPACE}/jars/${TOKEN}/${REVISION}.jar" | |||
fi | |||
|
|||
python3 -m hailtop.aiotools.copy -vvv 'null' '[{"from":"'${SHADOW_JAR}'", "to":"'${JAR_LOCATION}'"}]' | |||
gcloud storage cp ${SHADOW_JAR} ${JAR_LOCATION} | |||
# python3 -m hailtop.aiotools.copy -vvv 'null' '[{"from":"'${SHADOW_JAR}'", "to":"'${JAR_LOCATION}'"}]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:/. Timeout issues, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh ya.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm -rf and please PR something to back off the default timeout so we can feel confident in copy.
.createScoped( | ||
"openid", | ||
"https://www.googleapis.com/auth/userinfo.email", | ||
"https://www.googleapis.com/auth/userinfo.profile" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need profile or openid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I can try to remove them.
override def accessToken(): String = { | ||
val context = new TokenRequestContext() | ||
// TODO I hope I dont have to give this scope | ||
context.addScopes("https://management.azure.com/.default") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openid should be enough no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya that's why I have the TODO, going to make sure I can trim that down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually replaced by the hail-specific OAuth scope.
infra/azure/modules/auth/main.tf
Outdated
oauth2_permission_scope { | ||
admin_consent_description = "Allow hailctl to access the Hail Batch service on behalf of the signed-in user." | ||
admin_consent_display_name = "hailctl" | ||
user_consent_description = "Allow hailctl to access the Hail Batch service on your behalf." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also the hail Python library right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm ya I can change the text here to the Hail Python Library.
build.yaml
Outdated
if [ "$?" -ne 0 ] | ||
then | ||
kubectl delete secret -n {{ default_ns.name }} ssl-config-hail-root | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated updates of the root cert, yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya this was merged in a different PR already. Dropped this commit.
'https://www.googleapis.com/auth/userinfo.email', | ||
'https://www.googleapis.com/auth/cloud-platform', | ||
'https://www.googleapis.com/auth/appengine.admin', | ||
'https://www.googleapis.com/auth/compute', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop the last three, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to change the defaults in this PR, as these are used for other purposes like the FS which definitely needs at least GCS scopes. I think as a follow-up we should get rid of default scopes in general and require use sites to specify the min set of scopes they ned.
78bec4b
to
f832b4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dismiss if ready for another look
86e00b3
to
0e1a34a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent changes. Some conceptual (but, I guess, not RFC-level, more code-level concepts) things we need to iron out and get written down. A few nits. Overall feels good!
auth/auth/auth.py
Outdated
@@ -518,6 +518,12 @@ async def rest_login(request: web.Request) -> web.Response: | |||
) | |||
|
|||
|
|||
@routes.get('/api/v1alpha/oauth2-client') | |||
async def hailctl_oauth_client(request): # pylint: disable=unused-argument | |||
idp = 'Google' if CLOUD == 'gcp' else 'Microsoft' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: I don't love that gcp
is a magic string for Google, but it is what is for now.
Shouldn't you use IdentityProvider.GOOGLE.value
and IdentityProvider.MICROSOFT.value
for the idp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, I'll use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump, the literal strings are still here.
auth/auth/auth.py
Outdated
raise web.HTTPUnauthorized() | ||
|
||
|
||
async def get_userinfo_from_login_id_or_hail_identity_id(request, login_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to only be used with a IdP's uid, right? I think I'm a bit confused on hail identity uid vs login id. The login id is the username from the email, right? And hail identity uid is the uid from the cloud provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
login_id
is what we use to identify human users. In GCP it is currently an email, in Azure it is the actual sub
of that user in AAD.
hail_identity
is the email in GCP or common name in Azure of the robot identity assigned to the user. hail_identity_uid
is the actual sub
of the robot identity in both of those places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to only be used with a IdP's uid, right
This is not the case because we do not store UIDs for human users in GCP, only their email.
review is in progress, sorry for the delay. |
auth/auth/auth.py
Outdated
@@ -518,6 +518,12 @@ async def rest_login(request: web.Request) -> web.Response: | |||
) | |||
|
|||
|
|||
@routes.get('/api/v1alpha/oauth2-client') | |||
async def hailctl_oauth_client(request): # pylint: disable=unused-argument | |||
idp = 'Google' if CLOUD == 'gcp' else 'Microsoft' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump, the literal strings are still here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not finished with review, but here's what I have
@@ -58,12 +58,53 @@ resource "azuread_application_password" "oauth2" { | |||
application_object_id = azuread_application.oauth2.object_id | |||
} | |||
|
|||
resource "random_uuid" "hailctl_oauth2_idenfier_uri_id" {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to apply this before we merge, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya. This is one of the annoying papercuts I was talking about. Unless we stack them, this PR and the ubuntu PR cannot be passing at the same time because they have different terraform changes.
I think we're very close; lots of little details of course, but we're close. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big but important and necessary change.
Great work.
@@ -205,6 +206,7 @@ def create_vm_config( | |||
DOCKER_ROOT_IMAGE=$(jq -r '.docker_root_image' userdata) | |||
DOCKER_PREFIX=$(jq -r '.docker_prefix' userdata) | |||
REGION=$(jq -r '.region' userdata) | |||
HAIL_AZURE_OAUTH_SCOPE=$(jq -r '.hail_azure_oauth_scope' userdata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder of this before my approval.
I put WIP on in light of the infra ordering issue. |
Deprecate hail-minted API keys in favor of using access tokens from the identity providers already associated with user identities. For more context and a high-level overview of the implementation, see this RFC