-
Notifications
You must be signed in to change notification settings - Fork 4
Play around with spark
spark = SparkSession \
.builder \
.master(<spark_host>) \ <- standalone mode on Cluster
.appName("meta_info") \
.getOrCreate()
For jupyternote book, if using py3, following vars are required:
from os import environ
environ['PYSPARK_PYTHON']='/home/ubuntu/anaconda3/bin/python'
environ['PYSPARK_DRIVER_PYTHON']='/home/ubuntu/anaconda3/bin/jupyter'
spark_host
can be found at Spark Cluster WebUI
Download Spark-xml reader:
wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-xml_2.10/0.4.1/spark-xml_2.10-0.4.1.jar -O $SPARK_HOME/jars/spark-xml_2.10-0.4.1.jar
pyspark --packages com.databricks:spark-xml_2.10:0.4.1
from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'
wget https://jdbc.postgresql.org/download/postgresql-42.2.2.jar -O $SPARK_HOME/jars/postgresql-42.2.2.jar
(or usr/local/spark/jars
)
pyspark --jars usr/local/spark/jars/postgresql-42.2.2.jar
???
jdbcDF = spark.read \
.format('jdbc') \
.option('url', 'jdbc:postgresql://<instance_name>.<user_name>.us-east-1.rds.amazonaws.com:5432/<dbname>') \
.option('dbtable', __credential__.table_name) \
.option('user', __credential__.user) \
.option('password', __credential__.password) \
.load()
(Ref: https://aws.amazon.com/getting-started/tutorials/create-connect-postgresql-db/)
Instructions:
https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver
All 3 following packages are required:
wget http://repo1.maven.org/maven2/com/databricks/spark-redshift_2.11/3.0.0-preview1/spark-redshift_2.11-3.0.0-preview1.jar -O $SPAKR_HOME/jars/spark-redshift_2.11-3.0.0-preview1.jar
wget http://repo1.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar -O $SPAKR_HOME/jars/spark-avro_2.11-4.0.0.jar
wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.12.1017/RedshiftJDBC41-1.2.12.1017.jar -O $SPAKR_HOME/jars/RedshiftJDBC41-1.2.12.1017.jar
wget https://github.com/ralfstx/minimal-json/releases/download/0.9.5/minimal-json-0.9.5.jar
Note: spark-redshift_2.11-3.0.0-preview1.jar
is the only version of spark-redshift
not causing S3 endpoint URI invalid
or java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class