Skip to content
This repository has been archived by the owner on Aug 10, 2023. It is now read-only.

tutorial to trigger dataflow jobs using cloud scheduler #1396

Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
68852e2
tutorial to trigger dataflow jobs using cloud scheduler
Aug 11, 2020
903e5c3
format the tutorial to fix circle ci checks
Aug 11, 2020
f6bd1bb
change the title format
Aug 11, 2020
126b76e
address comments
Aug 12, 2020
5407372
Merge branch 'master' into zhong-cloud-scheduler-dataflow-tutorial
ToddKopriva Aug 14, 2020
e62b1f7
Merge branch 'master' into zhong-cloud-scheduler-dataflow-tutorial
ToddKopriva Aug 16, 2020
488f91e
Templating and step-by-step instructions
jpatokal Aug 20, 2020
5136fd2
Enable APIs
jpatokal Aug 20, 2020
d268bba
Add template compilation
jpatokal Aug 20, 2020
c46d0b2
add the architecture diagram
Aug 23, 2020
168295c
rename the build script
Aug 23, 2020
e174675
address comments
Aug 24, 2020
3752e16
minor fixes
Aug 24, 2020
ff0ec4f
address comments
Aug 24, 2020
144c1cf
add cloudbuild sa setup
Aug 24, 2020
eaa5d01
add project iam admin role
Aug 24, 2020
7090f9b
Merge branch 'master' into zhong-cloud-scheduler-dataflow-tutorial
ToddKopriva Aug 24, 2020
f28ec2b
add dummy logic for dataflow job
Aug 24, 2020
c210975
Merge branch 'zhong-cloud-scheduler-dataflow-tutorial' of github.com:…
Aug 24, 2020
429e060
Merge branch 'master' into zhong-cloud-scheduler-dataflow-tutorial
ToddKopriva Aug 25, 2020
3135df0
update sa setup
Aug 25, 2020
0c784f6
Merge branch 'zhong-cloud-scheduler-dataflow-tutorial' of github.com:…
Aug 25, 2020
af1b6d6
first edit pass during readthrough
ToddKopriva Aug 31, 2020
ebddd2d
second edit pass
ToddKopriva Aug 31, 2020
52c8cf1
Merge branch 'master' into zhong-cloud-scheduler-dataflow-tutorial
ToddKopriva Aug 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 55 additions & 5 deletions tutorials/schedule-dataflow-jobs-with-cloud-scheduler/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@ If you don't explicitly set the location in the request, the jobs will be create

In this tutorial, you will learn how to set up a [Cloud Scheduler](https://cloud.google.com/scheduler/) job to trigger to your
Dataflow batch jobs.
You can find all the code [here](./scheduler-dataflow-demo).


Here is a high level architecture diagram, and you can find all the code [here](./scheduler-dataflow-demo)..
![diagram](scheduler-dataflow-diagram.png)

[Cloud Dataflow](https://cloud.google.com/dataflow) is a managed service for handling
both streaming and batch jobs. For your streaming jobs, you just need to launch them once without worrying about operating them afterwards.
Expand All @@ -42,7 +45,7 @@ resource "google_cloud_scheduler_job" "scheduler" {

http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://zhong-gcp/templates/dataflow-demo-template"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://${var.bucket}/templates/dataflow-demo-template"
oauth_token {
service_account_email = google_service_account.cloud-scheduler-demo.email
}
Expand All @@ -57,7 +60,7 @@ resource "google_cloud_scheduler_job" "scheduler" {
},
"environment": {
"maxWorkers": "10",
"tempLocation": "gs://zhong-gcp/temp",
"tempLocation": "gs://${var.bucket}/temp",
"zone": "us-west1-a"
}
}
Expand All @@ -67,7 +70,54 @@ EOT
}
```

Afterwards you are all set up!
## Instructions

The following step-by-step instructions can be used to create a sample Dataflow pipeline with Cloudbuild.

First, open Cloud Shell and clone the repository.

```
git clone https://github.com/GoogleCloudPlatform/community
cd community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler/scheduler-dataflow-demo/
```

Create a bucket on Google Cloud Storage, which will be used to store terraform states and Dataflow templates.
Replace BUCKET_NAME and PROJECT_ID with your own choice.
jpatokal marked this conversation as resolved.
Show resolved Hide resolved
You can skip this step if you already have one GCS bucket created.

```
export BUCKET_NAME=[BUCKET_NAME]
export BUCKET=gs://${BUCKET_NAME}
gsutil mb -p [PROJECT_ID] ${BUCKET}
```

The next step is to create a backend for Terraform to store the states of
GCP resources. Run the below command to create a remote backend service using GCS as the storage.
```
cd terraform
cat > backend.tf << EOF
terraform {
backend "gcs" {
bucket = "${BUCKET_NAME}"
prefix = "terraform/state"
}
}
EOF
```

Follow the [instruction](https://cloud.google.com/scheduler/docs/quickstart) to create a App Engine, which is needed to
set up Cloud Scheduler jobs.

Note: Cloud Scheduler jobs need to be created in the same region as the App engine.

Afterwards, you can submit a Cloudbuild job to create all the resources.
Replace the *REGION* and *PROJECT_ID* with your own values.
jpatokal marked this conversation as resolved.
Show resolved Hide resolved

```
cd ..
gcloud builds submit --config=cloudbuild.yaml \
--substitutions=_BUCKET=${BUCKET},_REGION=[REGION],_PROJECT_ID=[PROJECT_ID] .
```

The job will run based on the schedule you defined in the terraform script.
In addition, you can manually run the scheduler through UI and watch it trigger your Dataflow batch job.
Expand All @@ -79,4 +129,4 @@ You can check the status of jobs through the UI.

## Cleaning up

Since this tutorial uses multiple GCP components, please be sure to delete the associated resources once you are done.
Since this tutorial uses multiple GCP components, please be sure to delete the associated resources once you are done.
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,20 @@ steps:
- |
terraform init -input=false
terraform workspace select scheduler-dataflow-demo || terraform workspace new scheduler-dataflow-demo
terraform apply -input=false \
-auto-approve
terraform apply -input=false -var=project_id=${_PROJECT_ID} -var=region=${_REGION}\
-var=bucket=${_BUCKET} -auto-approve
waitFor: ['-']

- id: "Build dataflow template"
name: maven:3.6.0-jdk-11-slim
dir: 'dataflow'
env:
- "PROJECT=${_PROJECT_ID}"
- "BUCKET=${_BUCKET}"
entrypoint: 'bash'
args:
- '-c'
- |
mvn compile exec:java \
-Dexec.mainClass=DataflowDemoPipeline \
-Dexec.args="--runner=DataflowRunner \
--project=zhong-gcp \
--stagingLocation=gs://zhong-gcp/staging \
--gcpTempLocation=gs://zhong-gcp/temp \
--region=us-west1 \
--templateLocation=gs://zhong-gcp/templates/dataflow-demo-template"
./build.sh
waitFor: ['Terraform init']

Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash

mvn compile exec:java \
-Dexec.mainClass=DataflowDemoPipeline \
-Dexec.args="--runner=DataflowRunner \
--project=${PROJECT} \
--stagingLocation=${BUCKET}/staging \
--gcpTempLocation=${BUCKET}/temp \
--region=us-central1 \
--templateLocation=${BUCKET}/templates/dataflow-demo-template"
Original file line number Diff line number Diff line change
@@ -1,13 +1,6 @@
terraform {
backend "gcs" {
bucket = "zhong-gcp"
prefix = "terraform/state"
}
}

provider "google" {
version = "~> 2.20"
project = "zhong-gcp"
project = var.project_id
}

# Use this data source to get project details. For more information see API.
Expand All @@ -19,11 +12,11 @@ resource "google_cloud_scheduler_job" "scheduler" {
schedule = "0 0 * * *"
# This needs to be us-central1 even if the app engine is in us-central.
# You will get a resource not found error if just using us-central.
region = "us-central1"
region = var.region

http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://zhong-gcp/templates/dataflow-demo-template"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://${var.bucket}/templates/dataflow-demo-template"
oauth_token {
service_account_email = google_service_account.cloud-scheduler-demo.email
}
Expand All @@ -38,8 +31,8 @@ resource "google_cloud_scheduler_job" "scheduler" {
},
"environment": {
"maxWorkers": "10",
"tempLocation": "gs://zhong-gcp/temp",
"zone": "us-west1-a"
"tempLocation": "gs://${var.bucket}/temp",
"zone": "${var.region}-a"
}
}
EOT
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
variable "project_id" {
type = string
default = "zhong-gcp"
}

variable "region" {
type = string
default = "us-west1"
default = "us-central1"
}

variable "bucket" {
type = string
}

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.