Workshop Apache Beam and Google DataFlow

📋 Table of content

HouseKeeping

Objectives
Frequently asked questions
Materials for the Session

LAB

01. Create Astra Account
02. Create Astra Token
03. Copy the token
04. Open Gitpod
05. Setup CLI
06. Create Database
07. Create Destination Table
08. Setup env variables
09. Setup Project
10. Run Importing Flow
11. Validate Data

WalkThrough

01. Compute Embeddings
02. Show results
03. Create Google Project
04. Enable project Billing
05. Save Project Id
06. Install gcloud CLI
07. Authenticate Against Google Cloud
08. Select your project
09. Enable Needed Apis
10. Setup Dataflow user
11. Create Secret
12. Move in proper folder
13. Setup env var
14. Run the pipeline
15. show Content of Table

HouseKeeping

Objectives

Introduce AstraDB and Vector Search capability
Give you an first understanding about Apache Beam and Google DataFlow
Discover NoSQL dsitributed databases and specially Apache Cassandra™.
Getting familiar with a few Google Cloud Platform services

Frequently asked questions

1️⃣ Can I run this workshop on my computer?

There is nothing preventing you from running the workshop on your own machine, If you do so, you will need the following

git installed on your local system
Java installed on your local system
Maven installed on your local system

In this readme, we try to provide instructions for local development as well - but keep in mind that the main focus is development on Gitpod, hence We can't guarantee live support about local development in order to keep on track with the schedule. However, we will do our best to give you the info you need to succeed.

2️⃣ What other prerequisites are required?

You will need an enough *real estate* on screen, we will ask you to open a few windows and it does not file mobiles (tablets should be OK)
You will need a GitHub account eventually a google account for the Google Authentication (optional)
You will need an Astra account: don't worry, we'll work through that in the following
As Intermediate level we expect you to know what java and maven are

3️⃣ Do I need to pay for anything for this workshop?

No. All tools and services we provide here are FREE. FREE not only during the session but also after.

4️⃣ Will I get a certificate if I attend this workshop?

Attending the session is not enough. You need to complete the homeworks detailed below and you will get a nice badge that you can share on linkedin or anywhere else *(open api badge)*

Materials for the Session

It doesn't matter if you join our workshop live or you prefer to work at your own pace, we have you covered. In this repository, you'll find everything you need for this workshop:

LAB

✅ `1` - Create your DataStax Astra account

ℹ️ Account creation tutorial is available in awesome astra

click the image below or go to https://astra.datastax./com

✅ `2` - Create an Astra Token

ℹ️ Token creation tutorial is available in awesome astra

Locate Settings(#1) in the menu on the left, thenToken Management` (#2)
Select the role Organization Administrator before clicking [Generate Token]

The Token is in fact three separate strings: a Client ID, a Client Secret and the token proper. You will need some of these strings to access the database, depending on the type of access you plan. Although the Client ID, strictly speaking, is not a secret, you should regard this whole object as a secret and make sure not to share it inadvertently (e.g. committing it to a Git repository) as it grants access to your databases.

{
  "ClientId": "ROkiiDZdvPOvHRSgoZtyAapp",
  "ClientSecret": "fakedfaked",
  "Token":"AstraCS:fake"
}

✅ `3` - Copy the token value in your clipboard

You can also leave the windo open to copy the value in a second.

✅ `4` - Open Gitpod

↗️ Right Click and select open as a new Tab...

✅ `5` - Set up the CLI with your token

In gitpod, in a terminal window:

Login

astra login --token AstraCS:fake

Validate your are setup

astra org

Output

gitpod /workspace/workshop-beam (main) $ astra org
+----------------+-----------------------------------------+
| Attribute      | Value                                   |
+----------------+-----------------------------------------+
| Name           | cedrick.lunven@datastax.com             |
| id             | f9460f14-9879-4ebe-83f2-48d3f3dce13c    |
+----------------+-----------------------------------------+

✅ `6` - Create destination Database and a keyspace

ℹ️ You can notice we enabled the Vector Search capability

Create db workshop_beam and wait for the DB to become active

astra db create workshop_beam -k beam --vector --if-not-exists

💻 Output

List databases

astra db list

💻 Output

Describe your db

astra db describe workshop_beam

💻 Output

✅ `7` - Create Destination table

Create Table:

astra db cqlsh workshop_beam -k beam \
  -e  "CREATE TABLE IF NOT EXISTS fable(document_id TEXT PRIMARY KEY, title TEXT, document TEXT)"

Show Table:

astra db cqlsh workshop_beam -k beam -e "SELECT * FROM  fable"

✅ `8` - Setup env variables

Create .env file with variables

astra db create-dotenv workshop_beam

Display the file

cat .env

Load env variables

set -a
source .env
set +a
env | grep ASTRA

✅ `9` - Setup project

This command will allows to validate that Java , maven and lombok are working as expected

mvn clean compile

✅ `10` - Run Importing flow

Open the CSV. It is very short and simple for demo purpose (and open API prices laters :) ).

/workspace/workshop-beam/samples-beam/src/main/resources/fables_of_fontaine.csv

Open the Java file with the code

gp open /workspace/workshop-beam/samples-beam/src/main/java/com/datastax/astra/beam/genai/GenAI_01_ImportData.java

Run the Flow

cd samples-beam
mvn clean compile exec:java \
 -Dexec.mainClass=com.datastax.astra.beam.genai.GenAI_01_ImportData \
 -Dexec.args="\
 --astraToken=${ASTRA_DB_APPLICATION_TOKEN} \
 --astraSecureConnectBundle=${ASTRA_DB_SECURE_BUNDLE_PATH} \
 --astraKeyspace=${ASTRA_DB_KEYSPACE} \
 --csvInput=`pwd`/src/main/resources/fables_of_fontaine.csv"

✅ `11` - Validate Data

astra db cqlsh workshop_beam -k beam -e "SELECT * FROM  fable"

WalkThrough

We will now compute the embedding leveraging OpenAPI. It is not free, you need to provide your credit card to access the API. This part is a walkthrough. If you have an openAI key follow with me !

✅ `1` Run Flow Compute

Setup Open AI

export OPENAI_API_KEY="<your_api_key>"

Open the Java file with the code

gp open /workspace/workshop-beam/samples-beam/src/main/java/com/datastax/astra/beam/genai/GenAI_02_CreateEmbeddings.java

Run the flow

mvn clean compile exec:java \
 -Dexec.mainClass=com.datastax.astra.beam.genai.GenAI_02_CreateEmbeddings \
 -Dexec.args="\
 --astraToken=${ASTRA_DB_APPLICATION_TOKEN} \
 --astraSecureConnectBundle=${ASTRA_DB_SECURE_BUNDLE_PATH} \
 --astraKeyspace=${ASTRA_DB_KEYSPACE} \
 --openAiKey=${OPENAI_API_KEY} \
 --table=fable"

✅ `2` Validate Output

astra db cqlsh workshop_beam -k beam -e "SELECT * FROM  fable"

✅ `3` Create Google Project

Create GCP Project

Note: If you don't plan to keep the resources that you create in this guide, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project. Create a new Project in Google Cloud Console or select an existing one.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project

✅ `4` Enable Billing

Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project

✅ `5` Save project ID:

The project identifier is available in the column ID. We will need it so let's save it as an environment variable

export GCP_PROJECT_ID=integrations-379317
export GCP_PROJECT_CODE=747469159044
export GCP_USER=cedrick.lunven@datastax.com
export GCP_COMPUTE_ENGINE=747469159044-compute@developer.gserviceaccount.com

✅ `6` Download and install gCoud CLI

curl https://sdk.cloud.google.com | bash

Do not forget to open a new Tab.

✅ `7` Authenticate with Google Cloud

Run the following command to authenticate with Google Cloud:

Execute:

gcloud auth login

Authenticate as your google Account

✅ `8` Set your project

If you haven't set your project yet, use the following command to set your project ID:

gcloud config set project ${GCP_PROJECT_ID}
gcloud projects describe ${GCP_PROJECT_ID}

✅ `9` Enable needed API

gcloud services enable dataflow compute_component \
   logging storage_component storage_api \
   bigquery pubsub datastore.googleapis.com \
   cloudresourcemanager.googleapis.com

✅ `10` Add Roles to `dataflow` users

To complete the steps, your user account must have the Dataflow Admin role and the Service Account User role. The Compute Engine default service account must have the Dataflow Worker role. To add the required roles in the Google Cloud console:

gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} \
    --member="user:${GCP_USER}" \
    --role=roles/iam.serviceAccountUser
gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID}  \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role=roles/dataflow.admin
gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID}  \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role=roles/dataflow.worker
gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID}  \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role=roles/storage.objectAdmin

`11` - ✅ Create secrets for the project in secret manager.

To connect to AstraDB you need a token (credentials) and a zip used to secure the transport. Those two inputs should be defined as secrets.

```
gcloud secrets create astra-token \
   --data-file <(echo -n "${ASTRA_TOKEN}") \
   --replication-policy="automatic"

gcloud secrets create cedrick-demo-scb \
   --data-file ${ASTRA_SCB_PATH} \
   --replication-policy="automatic"

gcloud secrets add-iam-policy-binding cedrick-demo-scb \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role='roles/secretmanager.secretAccessor'

gcloud secrets add-iam-policy-binding astra-token \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role='roles/secretmanager.secretAccessor'
    
gcloud secrets list
```

✅ `12` Make sure you are in `samples-dataflow` folder

cd samples-dataflow
pwd

`13` ✅ Make sure you have those variables initialized

We assume the table languages exists and has been populated in 3.1

export ASTRA_SECRET_TOKEN=projects/747469159044/secrets/astra-token/versions/2
export ASTRA_SECRET_SECURE_BUNDLE=projects/747469159044/secrets/secure-connect-bundle-demo/versions/1

`14` - ✅ Run the pipeline

mvn compile exec:java \
 -Dexec.mainClass=com.datastax.astra.dataflow.AstraDb_To_BigQuery_Dynamic \
 -Dexec.args="\
 --astraToken=${ASTRA_SECRET_TOKEN} \
 --astraSecureConnectBundle=${ASTRA_SECRET_SECURE_BUNDLE} \
 --keyspace=${ASTRA_KEYSPACE} \
 --table=fable \
 --runner=DataflowRunner \
 --project=${GCP_PROJECT_ID} \
 --region=us-central1"

`15` - ✅ Show the Content of the Table

A dataset with the keyspace name and a table with the table name have been created in BigQuery.

bq head -n 10 ${ASTRA_KEYSPACE}.${ASTRA_TABLE}

The END

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
img		img
samples-beam		samples-beam
samples-dataflow		samples-dataflow
slides		slides
.env		.env
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
README.md		README.md
pom.xml		pom.xml

datastaxdevs/workshop-beam

Folders and files

Latest commit

History

Repository files navigation

Workshop Apache Beam and Google DataFlow

📋 Table of content

HouseKeeping

Objectives

Frequently asked questions

Materials for the Session

LAB

✅ 1 - Create your DataStax Astra account

✅ 2 - Create an Astra Token

✅ 3 - Copy the token value in your clipboard

✅ 4 - Open Gitpod

✅ 5 - Set up the CLI with your token

✅ 6 - Create destination Database and a keyspace

✅ 7 - Create Destination table

✅ 8 - Setup env variables

✅ 9 - Setup project

✅ 10 - Run Importing flow

✅ 11 - Validate Data

WalkThrough

✅ 1 Run Flow Compute

✅ 2 Validate Output

✅ 3 Create Google Project

✅ 4 Enable Billing

✅ 5 Save project ID:

✅ 6 Download and install gCoud CLI

✅ 7 Authenticate with Google Cloud

✅ 8 Set your project

✅ 9 Enable needed API

✅ 10 Add Roles to dataflow users

11 - ✅ Create secrets for the project in secret manager.

✅ 12 Make sure you are in samples-dataflow folder

13 ✅ Make sure you have those variables initialized

14 - ✅ Run the pipeline

15 - ✅ Show the Content of the Table

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

✅ `1` - Create your DataStax Astra account

✅ `2` - Create an Astra Token

✅ `3` - Copy the token value in your clipboard

✅ `4` - Open Gitpod

✅ `5` - Set up the CLI with your token

✅ `6` - Create destination Database and a keyspace

✅ `7` - Create Destination table

✅ `8` - Setup env variables

✅ `9` - Setup project

✅ `10` - Run Importing flow

✅ `11` - Validate Data

✅ `1` Run Flow Compute

✅ `2` Validate Output

✅ `3` Create Google Project

✅ `4` Enable Billing

✅ `5` Save project ID:

✅ `6` Download and install gCoud CLI

✅ `7` Authenticate with Google Cloud

✅ `8` Set your project

✅ `9` Enable needed API

✅ `10` Add Roles to `dataflow` users

`11` - ✅ Create secrets for the project in secret manager.

✅ `12` Make sure you are in `samples-dataflow` folder

`13` ✅ Make sure you have those variables initialized

`14` - ✅ Run the pipeline

`15` - ✅ Show the Content of the Table

Packages