Table of Contents (dataproc cluster debian-9 with zookeeper, kafka ,BigQuery and other tools with Terraform)

Pre-reqs
Creation and destroying
Cluster details
Cloud Dataproc version
URLs and extra components via dataproc-initialization-actions
Terraform graph
Automatic provisioning
Testing Kafka
Reporting bugs
Patches and pull requests

Pre-reqs

Download and Install Terraform
Download and install google cloud sdk
- One may install gcloud sdk silently for all users as root with access to GCLOUD_HOME for only speficic user:
  
  export $USERNAME="<<you_user_name>>"
  
  export SHARE_DATA=/data
  
  su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME
  
  echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh
  
  echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh
Clone this repository and cd into dataproc folder. git clone https://github.com/cloudgear-io/dataproc-terraform && cd dataproc-terraform
Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder.

Creation and destroying

terraform init

terraform plan -out "run.plan

terraform apply "run.plan"

terraform destroy

Please note - firewall.tf has all ports open and please switch them off before creation if you so want.

Cluster details

Name	Role	Staging Bucket
poccluster-m poccluster-w-*	Default 3 Masters and Auto HA with Zookepeer Number of workers are prompted	dataproc-poc-staging-bucket

Cloud Dataproc version

Version	Includes	Base OS	Released On	Last Updated (sub-minor version)	Notes
1.3-deb9	Apache Spark 2.3.1 Apache Hadoop 2.9.0 Apache Pig 0.17.0 Apache Hive 2.3.2 Apache Tez 0.9.0* Cloud Storage connector 1.9.8-hadoop2	Debian 9	2018/08/16	2018/10/26 (1.3.14-deb9)	All releases on and after November 2, 2018 will be based on Debian 9.

URLs and extra components via dataproc-initialization-actions

YARN ResourceManager: http://<<Master_External_IP>>:8088/cluster
HDFS NameNode: http://<<Master_External_IP>>:9870
Hadoop Job History Server: http://<<Master_External_IP>>:19888/jobhistory
Node Managers: http://<<Individual_Node_External_IP>>:8042
Ganglia: http://<<Master_External_IP>>:80/ganglia
Livy: http://<<Master_External_IP>>:8998
docker latest is installed on all nodes.

Terraform Graph

Please generate dot format (Graphviz) terraform configuration graphs for visual representation of the repo.

terraform graph | dot -Tsvg > graph.svg

Also, one can use Blast Radius on live initialized terraform project to view graph. Please shoot in dockerized format:

docker ps -a|grep blast-radius|awk '{print $1}'|xargs docker kill && rm -rf dataproc-terraform && git clone https://github.com/cloudgear-io/dataproc-terraform && cd dataproc-terraform/ && terraform init && docker run --cap-add=SYS_ADMIN -dit --rm -p 5006:5000 -v $(pwd):/workdir:ro 28mm/blast-radius && cd ../../

A live example is here for this project.

Automatic Provisioning

https://github.com/cloudgear-io/dataproc-terraform/

Pre-req:

gcloud should be installed. Silent install is - export $USERNAME="<<you_user_name>>" && export SHARE_DATA=/data && su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME && echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh && echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh &&
Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder of the gke-terraform.

Plan:

terraform init && terraform plan -var cluster_location=europe-west2 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeper 3 masters>> -out "run.plan"

Apply:

terraform apply "run.plan"

Destroy:

terraform destroy -var cluster_location=europe-west2 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeper 3 masters>>

Testing Kafka

Once the cluster has been created Kafka should be running on all worker nodes in the cluster, and Kafka libraries should be installed on the master node(s). You can test your Kafka setup by creating a simple topic and publishing to it with Kafka's command-line tools, after SSHing into one of your master nodes:

gcloud compute ssh gcloud compute ssh poccluster-m --zone europe-west2-b

Create a test topic, just talking to the local master's zookeeper server.

/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic test

/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --list

Use worker 0 as broker to publish 10 messages over 10 seconds asynchronously.

export CLUSTER_NAME=$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)

for i in {0..10}; do echo "message${i}"; sleep 1; done

/usr/lib/kafka/bin/kafka-console-producer.sh --broker-list ${CLUSTER_NAME}-w-0:9092 --topic test &

User worker 1 as broker to consume those 10 messages as they come.This can also be run in any other master or worker node of the cluster.

/usr/lib/kafka/bin/kafka-console-consumer.sh --bootstrap-server ${CLUSTER_NAME}-w-1:9092 --topic test --from-beginning

.

Reporting bugs

Please report bugs by opening an issue in the GitHub Issue Tracker. Bugs have auto template defined. Please view it here

Patches and pull requests

Patches can be submitted as GitHub pull requests. If using GitHub please make sure your branch applies to the current master as a 'fast forward' merge (i.e. without creating a merge commit). Use the git rebase command to update your branch to the current master if necessary

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
credentials		credentials
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
_config.yml		_config.yml
firewalls.tf		firewalls.tf
main.tf		main.tf
networks.tf		networks.tf
provider.tf		provider.tf
schema.json		schema.json
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents (dataproc cluster debian-9 with zookeeper, kafka ,BigQuery and other tools with Terraform)

Pre-reqs

Creation and destroying

Cluster details

Cloud Dataproc version

URLs and extra components via dataproc-initialization-actions

Terraform Graph

Automatic Provisioning

Testing Kafka

Reporting bugs

Patches and pull requests

About

Releases

Packages

Languages

License

cloudgear-io/dataproc-terraform

Folders and files

Latest commit

History

Repository files navigation

Table of Contents (dataproc cluster debian-9 with zookeeper, kafka ,BigQuery and other tools with Terraform)

Pre-reqs

Creation and destroying

Cluster details

Cloud Dataproc version

URLs and extra components via dataproc-initialization-actions

Terraform Graph

Automatic Provisioning

Testing Kafka

Reporting bugs

Patches and pull requests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages