Authentication and service account plan for Pipeline + Kubeflow #374

IronPan · 2018-11-26T18:50:11Z

Currently in Pipeline instruction, we deploy the cluster with cloud-platform scope. Without any configuration, the pipeline and any derived argo jobs will run under the default service account (default Compute Engine service account). This unrestricted setup was OK in the past for development and testing propose. But as project goes public, I think it might want to change the service account setup into a managed way.

Kubeflow uses different service accounts for different roles.

kf admin: managing the kubeflow cluster. E.g. networking, kf deployment config
kf user: have access to various GCP APIs such as GCS and BQ, for various ML jobs.

For ML pipeline, the argo job, as well as any derived workload such as tf-job, should ideally run under kf-user service account in order to access to GCP API. To achieve this here are some todo items needed from various pipeline points.

Pipeline System need to mount the kf-user service account key to each pod in the argo job and set up GOOGLE_APPLICATION_CREDENTIALS environment variable. Kubeflow stores the service account key as K8s secrete in the cluster.

apiVersion: v1
kind: Pod
metadata:
  generateName: gcloud
  namespace: kubeflow
spec:
  containers:
  - image: google/cloud-sdk
    name: gcloud
    command: [sleep]
    args: ["10000000"]
    env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: "/etc/secrets/user-gcp-sa.json"      
    volumeMounts:
    - name: sa
      mountPath: "/etc/secrets"
      readOnly: true
  volumes:
  - name: sa
    secret:
      secretName: user-gcp-sa

The service account need to be activated inside the pod using gcloud auth activate-service-account, before GCP API call. Please see here as an example.
For tf-job samples, the scheduler need to do the same and mount the service account to the tf-job. Otherwise tf-job can't write output to GCS. E.g. here
For other places that we create pod that reads data from GCS, we need to do the same. E.g tensorboard
Kaniko build which is used by notebook to build images also need to be part of this effort, since it touches cloud registry.

The text was updated successfully, but these errors were encountered:

IronPan · 2018-11-26T18:55:07Z

Update-
Tensorboard fixed #273
Kaniko fixed #343
DSL supports volume ane env APIs now #300

TODO:
Update current released sample to use the strongly typed GCP op #314

jlewi · 2018-12-03T15:11:40Z

@IronPan what is the remaining work here?

IronPan · 2018-12-03T18:36:41Z

@Ark-kun How is the work to migrate the samples to use the gcp credential?

jlewi · 2018-12-17T14:16:34Z

@Ark-kun @IronPan Any update on this issue?

IronPan · 2018-12-31T20:55:38Z

The samples are updated to have the right permission now.

jlewi added the priority/p1 label Dec 3, 2018

IronPan closed this as completed Dec 31, 2018

jessiezcc mentioned this issue Aug 2, 2019

[GCP] pipelines supports workload identity #1691

Closed

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024

Simplify deployment docs for 1.2 (kubeflow#374)

bbff3ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Authentication and service account plan for Pipeline + Kubeflow #374

Authentication and service account plan for Pipeline + Kubeflow #374

IronPan commented Nov 26, 2018 •

edited

Loading

IronPan commented Nov 26, 2018

jlewi commented Dec 3, 2018

IronPan commented Dec 3, 2018

jlewi commented Dec 17, 2018

IronPan commented Dec 31, 2018

Authentication and service account plan for Pipeline + Kubeflow #374

Authentication and service account plan for Pipeline + Kubeflow #374

Comments

IronPan commented Nov 26, 2018 • edited Loading

IronPan commented Nov 26, 2018

jlewi commented Dec 3, 2018

IronPan commented Dec 3, 2018

jlewi commented Dec 17, 2018

IronPan commented Dec 31, 2018

IronPan commented Nov 26, 2018 •

edited

Loading