Table of Contents (dataproc cluster debian-9 with zookeeper, kafka ,BigQuery and other tools with Terraform)
- Pre-reqs
- Creation and destroying
- Cluster details
- Cloud Dataproc version
- URLs and extra components via dataproc-initialization-actions
- Terraform graph
- Automatic provisioning
- Testing Kafka
- Reporting bugs
- Patches and pull requests
-
Download and install google cloud sdk
-
One may install gcloud sdk silently for all users as root with access to GCLOUD_HOME for only speficic user:
export $USERNAME="<<you_user_name>>"
export SHARE_DATA=/data
su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME
echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh
echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh
-
-
Clone this repository and cd into dataproc folder.
git clone https://github.com/cloudgear-io/dataproc-terraform && cd dataproc-terraform
-
Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder.
terraform init
terraform plan -out "run.plan
terraform apply "run.plan"
terraform destroy
Please note -
firewall.tf
has all ports open and please switch them off before creation if you so want.
Name | Role | Staging Bucket |
---|---|---|
poccluster-m poccluster-w-* |
Default 3 Masters and Auto HA with Zookepeer Number of workers are prompted |
dataproc-poc-staging-bucket |
Version | Includes | Base OS | Released On | Last Updated (sub-minor version) | Notes |
---|---|---|---|---|---|
1.3-deb9 | Apache Spark 2.3.1 Apache Hadoop 2.9.0 Apache Pig 0.17.0 Apache Hive 2.3.2 Apache Tez 0.9.0* Cloud Storage connector 1.9.8-hadoop2 |
Debian 9 | 2018/08/16 | 2018/10/26 (1.3.14-deb9) |
All releases on and after November 2, 2018 will be based on Debian 9. |
-
YARN ResourceManager:
http://<<Master_External_IP>>:8088/cluster
-
HDFS NameNode:
http://<<Master_External_IP>>:9870
-
Hadoop Job History Server:
http://<<Master_External_IP>>:19888/jobhistory
-
Node Managers:
http://<<Individual_Node_External_IP>>:8042
-
Ganglia:
http://<<Master_External_IP>>:80/ganglia
-
Livy:
http://<<Master_External_IP>>:8998
-
docker latest is installed on all nodes.
Please generate dot format (Graphviz) terraform configuration graphs for visual representation of the repo.
terraform graph | dot -Tsvg > graph.svg
Also, one can use Blast Radius on live initialized terraform project to view graph. Please shoot in dockerized format:
docker ps -a|grep blast-radius|awk '{print $1}'|xargs docker kill && rm -rf dataproc-terraform && git clone https://github.com/cloudgear-io/dataproc-terraform && cd dataproc-terraform/ && terraform init && docker run --cap-add=SYS_ADMIN -dit --rm -p 5006:5000 -v $(pwd):/workdir:ro 28mm/blast-radius && cd ../../
A live example is here for this project.
https://github.com/cloudgear-io/dataproc-terraform/
Pre-req:
-
gcloud should be installed. Silent install is -
export $USERNAME="<<you_user_name>>" && export SHARE_DATA=/data && su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME && echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh && echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh &&
-
Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder of the gke-terraform.
Plan:
terraform init && terraform plan -var cluster_location=europe-west2 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeper 3 masters>> -out "run.plan"
Apply:
terraform apply "run.plan"
Destroy:
terraform destroy -var cluster_location=europe-west2 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeper 3 masters>>
Once the cluster has been created Kafka should be running on all worker nodes in the cluster, and Kafka libraries should be installed on the master node(s). You can test your Kafka setup by creating a simple topic and publishing to it with Kafka's command-line tools, after SSHing into one of your master nodes:
gcloud compute ssh gcloud compute ssh poccluster-m --zone europe-west2-b
Create a test topic, just talking to the local master's zookeeper server.
/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic test
/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --list
Use worker 0 as broker to publish 10 messages over 10 seconds asynchronously.
export CLUSTER_NAME=$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)
for i in {0..10}; do echo "message${i}"; sleep 1; done
/usr/lib/kafka/bin/kafka-console-producer.sh --broker-list ${CLUSTER_NAME}-w-0:9092 --topic test &
User worker 1 as broker to consume those 10 messages as they come.This can also be run in any other master or worker node of the cluster.
/usr/lib/kafka/bin/kafka-console-consumer.sh --bootstrap-server ${CLUSTER_NAME}-w-1:9092 --topic test --from-beginning
.
Please report bugs by opening an issue in the GitHub Issue Tracker. Bugs have auto template defined. Please view it here
Patches can be submitted as GitHub pull requests. If using GitHub please make sure your branch applies to the current master as a 'fast forward' merge (i.e. without creating a merge commit). Use the git rebase
command to update your branch to the current master if necessary