Skip to content

Some batch processing demos with various data warehouses like local, S3 and HDFS in AWS

Notifications You must be signed in to change notification settings

airztz/Python4fun

Repository files navigation

Python4fun

Spark & Pandas batch processing demo, data will be loaded from local, remote s3 and HDFS.

Quickstart

(1) Run locally

python spark-pandas-hdfs-s3.py

(2) Subimit job to your spark stand-alone cluster, if you already have one:)

$SPARK_HOME/bin/spark-submit --master spark://node1:7077 --deploy-mode cluster --executor-memory 1g spark-pandas-hdfs-s3.py

(2) Submit job to yarn on top of your hdfs cluster, if you already have one:)

$spark-submit --master yarn --deploy-mode cluster --executor-memory 1g --packages com.databricks:spark-csv_2.10:1.5.0 spark-CDH5-1.6-hdfs-yarn.py

The logs you may want to expect in your local testing:

img img

About

Some batch processing demos with various data warehouses like local, S3 and HDFS in AWS

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages