Skip to content

Latest commit

 

History

History
308 lines (257 loc) · 19.3 KB

README.md

File metadata and controls

308 lines (257 loc) · 19.3 KB

H2O Open Source clusters on GCP

Terraform deployment template to spin up a 3 node H2O Open Source cluster in GCP. This template is based on this tutorial

This template is a work in progress and is provided without any warranty or support. You are free to refer/modify it as you need.

There are two distinct parts to this setup

  1. Setting up the GCP project, service account, VPC, Subnet etc. Also in this step we create the GCP compute instance we call Workspace. All H2O users will need to ssh to this workspace instance to create the H2O cluster in the private subnet (not directly accessible). The workspace instance is in the public subnet and forms the gateway for all communication between the H2O cluster and machine of data scientist.
  2. Creating the H2O cluster using the h2ocluster tool available on the Workspace instance

Step 1: Main Infrastructure and Workspace Setup

These activities are performed by someone who have Cloud Admin privileges. In this step we perform pre-required activities using the browser on the GCP console, and then setup gcloud sdk tool on a machine where we use terraform to get the entire infrastructure setup.

Create a GCP Project

  • Using a web browser, login to GCP Console
  • Create a new Project
  • Ensure Billing is enabled for the project
  • Enable needed APIs and services (link is on the top of project dashboard)
    • Compute Engine API
    • Identity and Access Management (IAM) API
    • Cloud Resorce Manager API
    • Cloud Monitoring API
    • Cloud Logging API
    • OS Config API
  • !! NOTE !! - For this work, the used project name is project48a.

Setup gcloud cli

  • Preferably setup the gcloud sdk on a linux based machine, ideally used by the Cloud system admin team to manage the cloud infrastructure.

  • Follow steps to install Google Cloud SDK

  • Create a new gcloud profile and authenticate

    $ gcloud config configurations create hemen-h2oai
    Created [hemen-h2oai].
    Activated [hemen-h2oai].
    $ gcloud auth login
    Your browser has been opened to visit:
    
        https://acco ..... deleted .... t_account
    
    You are now logged in as [hemen.kapadia@h2o.ai].
    Your current project is [None].  You can change this setting by running:
      $ gcloud config set project PROJECT_ID
    
  • Setup project. You would have already created a project from the GUI as discussed earlier. Ensure it has billing enabled as well as services API enabled.

    $ gcloud config set project project48a
    Updated property [core/project].
    
  • Setup compute region. You can use gcloud compute regions list to get a list of available compute regions

    $ gcloud config set compute/region us-west1
    Updated property [compute/region].
    
  • Setup compute zone. You can use gcloud compute zones list to get a list of available compute zones

    $ gcloud config set compute/zone us-west1-a
    Updated property [compute/zone].
    
  • Check all set configurations are as expected

    $ gcloud config list
    [compute]
    region = us-west1
    zone = us-west1-a
    [core]
    account = hemen.kapadia@h2o.ai
    disable_usage_reporting = True
    project = project48a
    
    Your active configuration is: [hemen-h2oai]

Setup Service Account

  • A total of 3 service accounts are needed for this to work end to end. Of the three, one is created manually and has the most privileges. The remaining two will are created by the terraform script

    • project48a-sa
      • This one is created manually as shown below using gcloud
      • It is used to setup the VPC, firewalls etc and also the Workspace instance
      • Needs Compute Admin to create instances and Storage Admin to manage state
    • workspaceinstate-sa
      • Terraform creates this
      • The SA assigned to the workspace instance started above.
      • This SA will then be used by Terraform to control the permissions of starting H2O clusters.
    • h2ocluster-sa
      • Terraform creates this
      • This SA will be assigned to each VM instance that forms the H2O cluster nodes
      • Access to google cloud storage and BigQuery
  • Create a Service Account for this Project

    gcloud iam service-accounts create project48a-sa \
        --description="Project48a Service Account" \
        --display-name="project48a-sa"
        
    gcloud iam service-accounts list 
  • Ensure the Service account has necessary priviledges. Here these may be a bit extra but more fin grained access roles could be given

    gcloud projects add-iam-policy-binding project48a --member serviceAccount:project48a-sa@project48a.iam.gserviceaccount.com --role roles/storage.admin 
    gcloud projects add-iam-policy-binding project48a --member serviceAccount:project48a-sa@project48a.iam.gserviceaccount.com --role roles/compute.admin
    gcloud projects add-iam-policy-binding project48a --member serviceAccount:project48a-sa@project48a.iam.gserviceaccount.com --role roles/iam.serviceAccountAdmin
    gcloud projects add-iam-policy-binding project48a --member serviceAccount:project48a-sa@project48a.iam.gserviceaccount.com --role roles/iam.serviceAccountUser
    gcloud projects add-iam-policy-binding project48a --member serviceAccount:project48a-sa@project48a.iam.gserviceaccount.com --role roles/iam.securityAdmin
    
  • Create a service account key for use with terraform. First create a directory structure as shown in the tree command. cat is used to check if the key file got created

    cd gcp/network
    gcloud iam service-accounts keys create gcpkey.json --iam-account project48a-sa@project48a.iam.gserviceaccount.com
    cat gcpkey.json
  • This service account key can now be used in Terraform by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS see for more information. Alternatively it could also be mentioned in the Terraform code

    export GOOGLE_APPLICATION_CREDENTIALS=`pwd`/gcpkey.json

Create shared GCS storage for TF backend

  • The TF code currently assumes that GCS bucket to store TF state is already created. We use this approach
    • Ensure the value of variable gcp_project_name in network/main/variables.tf is in sync with the project name used in the below command to create the tf state backend bucket
    • gsutil mb gs://project48a-tfstate to create the bucket
    • gsutil versioning set on gs://project48a-tfstate to eanble versioning support
  • In web browser, select the project and left top menu dropdown select Storage >> Browser and validate the bucket is created.
  • An alternate option is to have TF create the bucket, but then would need terraform apply in multiple folders as in https://github.com/tasdikrahman/terraform-gcp-examples

Next we configure terrafom to use gcs backend for state mangement

  • Add this block to gcp/main/terraform.tf file
backend "gcs" {
    bucket = "project48a-tfstate"
    prefix = "h2o/terraform"
  }

Terraform project structure

The directory structure now is

gcp
├── network                        	
├── h2ocluster                      
  • network directory
    • contains all Terraform code that will setup the VPC, Subnets, Firewall, Workspace instance, service accounts etc.
    • executed only one time
    • happens on any external machine, possibly a cloud admins laptop
  • h2ocluster directory
    • this directory is zipped and should be moved to /opt/h2ocluster in the workspace system
    • contains Terraform code that will setup a N node H2O cluster instance in the private subnet when requested by a user.
    • will be executed multiple times by the user to start/stop the cluster.
    • will not be executed directly as terraform apply or destroy. Instead a bash wrapper will be provided to list, create and destroy the custer instance
    • list will use gcloud commands whereas create and destroy will leverage the terraform code in this directory.

Terraform init and apply

  • Navigate to gcp/network directory and run terraform init to initialize terraform.
    • a terraform.tfstate will be created in the network/.terraform directory with details about the gcs backend and modules
    • the TF state file without any resources is created in GCP backend with the file named default.tfstate.
    • gsutil cat gs://project48a-tfstate/h2o/terraform/default.tfstate to view the content of this initial state
  • terraform apply can be used to create all the necessary network and workspace resources
  • terraform show can be used to see the resources state
  • terraform refresh can be used to update state informaton with the chages in real world infra that happened via Google Web console.
  • At this point we trigger a terraform apply to create the VPC, public + private subnets, firewall rules, NAT gateways, service account etc. and the main Workspace machine on a GCP Compute instance.

Workspace machine

  • This is a single machine like a bastion host, in the public subnet of VPC. It should be up and running now.
  • Instances in the public subnet will get an external IP and hence are internet accessible.
  • Instances without a public address are private and as a convention we put them in the private subnet.
  • After terraform apply when the workspace machine was created it can be accessed with
    • gcloud beta compute ssh --zone "us-west1-a" --project "project48a" --ssh-key-file=~/.ssh/google_compute/id_rsa "h2o-instance-workspace"
  • Create SSH key - For the very first time we would not have an ssh key to use.
    • Assuming that you have completed the gcloud auth login step from point 3 above you can run the above command without --ssh-key-file option.
    • This will create the files google_compute_engine, google_compute_engine.pub and google_compute_engine.knownhosts files in $HOME/.ssh directory.
    • Will work only if in Project >> IAM your user id will have Compute OS Login or Compute OS Admin Login roles to your member.
  • It should be able to access this machine now with the above command
  • Additionally, once done with the above command we can then use normal ssh also. Note the username to use when we connect above. You can get this username to use when you ssh above.
    • ssh -i ~/.ssh/google_compute_engine hemen_kapadia_h2o_ai@35.247.123.203

If the Workspace machine is created and you are able to ssh to it, we conclude step 1 of creating the infrastructure setup

Step 2: Creating H2O clusters

  1. On Workspace Servers

  • Once Workspace server is up and running check. jq --version, terraform --version, gcloud config list. All these commands should be working. Additionally gcloud should be able to detect the service account that is associated with the Workspace compute instance.
  • To avoid the zone prompt for some of the commands used internally by the h2ocluster tool set the zone information using gcloud config set compute/region us-west1
  • Update PATH variable export PATH="$PATH:/opt/h2ocluster/terraform"
  • Initialise h2ocluster --help
  • Read the usage of h2ocluster tool using h2ocluster --help
  • Create a cluster h2ocluster create
  • Once created note the IP and Port information displayed
    H2O Cluster Information:
    =========================
    Cluster Name: h2o-hemenkap-letxk7f-cluster
    Cluster Size: 3
    Cluster Leader IP and PORT: 10.100.1.2:54321
    Cluster Leader Url: http://10.100.1.2:54321/flow/index.html#
    
  • Using this info, create an ssh local port forward to the H2O cluster created in the private subnet via the Workspace machine (which is like a bastion). You can select any local port to forward. I used 8888 in this example.
    ssh -i ~/.ssh/google_compute_engine -L 8888:10.100.1.2:54321  hemen_kapadia_h2o_ai@35.247.123.203
  • Open a browser on your laptop and go to URL http://localhost:8888. You will see the H2O flow UI.
  • If you are running Python or R code to connect to the H2O cluster then the cluster address will be different based on where you code is executing.
    • Workspace machine use http://10.100.1.2:54321
    • Local machine/laptop with ssh forwarding use http://localhost:8888

Step 3: Accessing data from Cloud Storage and BigQuery

Here we will upload some data into S3 bucket and then import it in to H2O-3 cluster. We will also create a BQ table and import it into H2O-3

Create data file in cloud storage and import it to H2O-3

  • Create a GCS bucket
    • gsutil mb -p project48a -c NEARLINE -l US-WEST1 -b on gs://h2ocluster-train-data
  • Upload some data files to the bucket
    • gsutil cp ~/Workspace/Office/Datasets/flights_delay/allyears2k.csv gs://h2ocluster-train-data/flights-delay/
    • gsutil cp ~/Workspace/Office/Datasets/flights_delay/airlines_all.05p.csv gs://h2ocluster-train-data/flights-delay/
  • To import this GCS file in H2O-3, create a new Flow Notebook, and in the cell enter the expression importFiles ['gs://h2ocluster-train-data/flights-delay/allyears2k.csv']. Run the cell byt hitting ctrl+Enter.
  • It will indicate that 1 file is imported, click the Parse these files.. button
  • A new section will show up, set necessary datatype changes for the columns and then click the Parse button at the bottom of the section.
  • A job will run which will generate a .hex file. This is the H2O-3 dataframe. Click the View button to get frame summary.
  • In the section that opens click View Data to see the imported data in H2O-3

Create table in BQ and import it to H2O-3

  • When using bq command for the first time it initialized and asked for the default project. My project is named project48a so I selected the same.
  • To list datasets in this project bq ls project48a:, the last part defining the project can be removed if default project is set
  • Create airlines dataset bq --location=us-west1 mk project48a:airlines
  • Verify it got created bq ls --format=pretty
  • Adding a table to the dataset and loading data in it
    • bq --location=us-west1 load --autodetect --null_marker="NA" --source_format=CSV project48a:airlines.allyears2k ~/Workspace/Office/Datasets/flights_delay/allyears2k.csv
    • bq ls --format=pretty project48a:airlines
    • bq show --format=pretty project48a:airlines.allyears2k will describe the table
  • To import this data into H2O-3, create a new notebook and click Data >> Import SQL Table. In the section that opens up enter the below information to read data from the above table.
    • JDBC URL: jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=project48a;OAuthType=3;Location=us-west1;LogLevel=4;LogPath=/tmp/h2o-bigquery-logs;
    • Table: project48a.airlines.allyears2k
    • Fetch mode can be Single or Distributed
    • Leave all other fields empty
  • Click the Import Button. It will open a new section. Click View button in this section.
  • A new section will open that shows the progress of the import job.
  • Once the job is 100% completed, click the View button to get the frame summary. Finally, click the View Data button to verify that the data was imported.

Optional Step: Create and use a custom H2O-3 image

  • To speed up the cluster creation times you can use an image with H2O preloaded on it.
  • Create a service account key for h2ocluster-vm-sa on your local machine and move it to the workspace machine under /opt/h2ocluster/packer/scripts
    gcloud iam service-accounts keys create h2ocluster-sa-key.json --iam-account h2ocluster-vm-sa@project48a.iam.gserviceaccount.com
  • To create such an image, on the Workspace machine follow the instructions below
    • cd /opt/h2ocluster/packer/
    • If needed update the variable values in the file h2o-gcp-image.json
    • packer build h2o-gcp-image.json
  • Once the image is built, it can be used in the terraform code to create H2O-3 clusters.
    • Edit file /opt/h2ocluster/terraform/terraform.tfvars and update the value of h2o_cluster_instance_boot_disk_image to the name of the packer imge.
  • Now the cluster load times will be significantly reduced as compared to the situation where we start from a bare RHEL7 image as the base

Good References

Useful for Workspace

Useful for Startup Script completion tracking