Skip to content
This repository has been archived by the owner on Aug 10, 2023. It is now read-only.

tutorial to trigger dataflow jobs using cloud scheduler #1396

Conversation

zhongchen
Copy link
Contributor

Port the Medium tutorial to the community.

https://jira.gcp.solutions/browse/PUB-2764

@google-cla google-cla bot added the cla: yes label Aug 11, 2020
@ToddKopriva ToddKopriva self-requested a review August 11, 2020 17:44
@ToddKopriva ToddKopriva self-assigned this Aug 11, 2020
@ToddKopriva
Copy link
Member

The CircleCI tests are failing because the document doesn't use the frontmatter that it needs.

To see what you need to include at the top, see a published document, like this one:
https://github.com/GoogleCloudPlatform/community/edit/master/tutorials/embedded-c-getting-started/index.md

Copy link
Member

@jpatokal jpatokal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions for adding detail. In step-by-step tutorials like this, it's good to be as explicit as possible and list out even steps that might be "obvious".

![Set up your cloud scheduler](set_up_the_cloud_scheduler.png)


If you use Terraform, here is one example to define a scheduler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tutorials usually avoid giving multiple paths. Should the user use the Console (above), or should they use Terraform? If TF, please give the command they would use to deploy this into their project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I will stick with the TF solution here.

Copy link
Member

@jpatokal jpatokal Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the Terraform approach, but there's a lot of setup involved which I think we need to cover.

I've forked your branch, templatized the TF a bit more and added the basic steps here:
https://github.com/jpatokal/community/tree/zhong-cloud-scheduler-dataflow-tutorial

Specific commits:
jpatokal@488f91e
5136fd2 d268bba

I don't have the right to commit directly into this PR, but maybe you can pull the commit above from my repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I have merged your changes into your branch.

@ToddKopriva
Copy link
Member

Thanks for the review @jpatokal .

@zhongchen , I'll wait until you have resolved these review comments before I begin the editorial and production review.

@ToddKopriva
Copy link
Member

@jpatokal , let me know when the review comments are resolved to your satisfaction.

@zhongchen
Copy link
Contributor Author

@ToddKopriva Can you start providing some feedback as well?

@jpatokal
Copy link
Member

@zhongchen Apologies for the delay, the notifications for this went into my personal acct (oops). See comments above.


http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://zhong-gcp/templates/dataflow-demo-template"
Copy link
Member

@jpatokal jpatokal Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting 403 PERMISSION_DENIED from Scheduler when invoking this? (URL updated to my own bucket, of course.) There's a Dataflow Step INFO log for "dataflow.jobs.create" with authorizationInfo: granted: true, which would imply that DF is OK, but no further logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc to use Terraform to create the SA and set the permission. I tried to run the step in a cloud console and it worked for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, and the Cloud Build SA cannot create/modify/act as other SAs or schedule jobs by default. Please direct the user to https://cloud.google.com/cloud-build/docs/securing-builds/configure-access-for-cloud-build-service-account#granting_a_role_using_the_iam_page and ask them to add these three roles:

  1. Service Accounts > Service Account Admin
  2. Service Accounts > Service Account User
  3. Cloud Scheduler > Cloud Scheduler Admin

But even after granting all these, assigning the extra roles to the successfully created new SA fails:

Step #0 - "Terraform init": Error: Batch "iam-project-hybrid-prom modifyIamPolicy" for request "Create IAM Members roles/dataflow.admin serviceAccount:scheduler-dataflow-demo@hybrid-prom.iam.gserviceaccount.com for "project \"hybrid-prom\""" returned error: Error retrieving IAM policy for project "hybrid-prom": googleapi: Error 403: The caller does not have permission, forbidden

...which makes no sense to me since roles/iam.serviceAccountAdmin includes iam.serviceAccounts.getIamPolicy. terraform graph also shows that the dependency is correctly recognized and the mods only made after the SA exists:

"[root] google_project_iam_member.cloud-scheduler-dataflow" -> "[root] google_service_account.cloud-scheduler-demo"
"[root] google_project_iam_member.cloud-scheduler-gcs" -> "[root] google_service_account.cloud-scheduler-demo"

Step #0 - "Terraform init": google_service_account.cloud-scheduler-demo: Creating...
Step #0 - "Terraform init": google_service_account.cloud-scheduler-demo: Creation complete after 2s [id=projects/hybrid-prom/serviceAccounts/scheduler-dataflow-demo@hybrid-prom.iam.gserviceaccount.com]
Step #0 - "Terraform init": google_project_iam_member.cloud-scheduler-dataflow: Creating...
Step #0 - "Terraform init": google_project_iam_member.cloud-scheduler-gcs: Creating...
[errors]

What am I missing? Any ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud Scheduler Admin
Dataflow Admin
Service Account User
Project IAM Admin

Maybe you are missing Project IAM Admin Role to manage IAM binding? I reduced the roles to the above four and it worked for me.

Can you try it as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A-ha! Project IAM Admin was the missing role, it works now!

Two more tweaks:

  • You need Service Account Admin to create the new SA
  • You don't need Dataflow Admin (because it's the new SA that kicks off the job)

Also, in the gcloud builds command you need to use $BUCKET_NAME (not $BUCKET), otherwise it tries to access gs://gs://bucket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually removed the bucket_name completely. Currently bucket represents the bucket name. In the scripts, I have added back the gs:// prefix for all the bucket paths.

@jpatokal
Copy link
Member

Working now! Zhong, please add in the two last tweaks, and @ToddKopriva , you'll be good to go.

@ToddKopriva
Copy link
Member

Thanks, @jpatokal .

@zhongchen , I'll wait until you have made the suggested change, and then I'll begin the editorial and production process.

zhong added 2 commits August 25, 2020 09:51
…zhongchen/community into zhong-cloud-scheduler-dataflow-tutorial
@zhongchen
Copy link
Contributor Author

@ToddKopriva I have addressed all the comments.

@ToddKopriva ToddKopriva merged commit 7249510 into GoogleCloudPlatform:master Aug 31, 2020
@zhongchen zhongchen deleted the zhong-cloud-scheduler-dataflow-tutorial branch August 31, 2020 20:25
@ToddKopriva
Copy link
Member

xiangshen-dk pushed a commit to xiangshen-dk/community that referenced this pull request Jan 24, 2022
…latform#1396)

* tutorial to trigger dataflow jobs using cloud scheduler

* format the tutorial to fix circle ci checks

* change the title format

* address comments

* Templating and step-by-step instructions

* Enable APIs

* Add template compilation

* add the architecture diagram

* rename the build script

* address comments

* minor fixes

* address comments

* add cloudbuild sa setup

* add project iam admin role

* add dummy logic for dataflow job

* update sa setup

* first edit pass during readthrough

* second edit pass

Co-authored-by: zhong <zhongchen@google.com>
Co-authored-by: Todd Kopriva <43478937+ToddKopriva@users.noreply.github.com>
Co-authored-by: Jani Patokallio <jani@google.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants