Setting up IPython Notebook with PySpark

This is a brief notes on setting up environments for running pyspark via ipython notebook with Spark v1.4.1. The steps detailed are with running standalone spark on a single node. If you wish to run on amazon clusters, there’s the spark’s ec2 script for running on Amazon EC2, or directly create the EMR job and select Spark as an add-on via AWS Web console.

Requirements

Java 1.7 or greater
Maven or Simple build tool (sbt)

Install Spark

Download latest spark (which is 1.4.1 as of July 2015) from https://spark.apache.org/downloads.html I've selected spark-1.4.1.tar

extract the tar file

tar -xvf spark-1.4.1.tar

build spark as per the README.md. alternatively you can download the pre-built version if you wish and you can skip this step.

mvn clean package -DskipTests

Setup your environment variables for "SPARK_HOME" E.g. in Unix environments, add the following to ~/.bash_profile

export SPARK_HOME=<location of the install>
export PATH=$SPARK_HOME/bin:$PATH

verify that pyspark is installed ok

cd <spark-distro-directory>
./bin/pyspark

import math
testRdd = sc.parallelize([4,16,9])
testRdd.map(math.sqrt).collect()

#verify results

#to quit
exit()

Install Python

Install Anaconda

Download Anaconda which include python 2.7 and the main scientific libraries http://ipython.org/install.html

conda update conda
conda update ipython ipython-notebook ipython-qtconsole

Anaconda comes with free spyder IDE. There is other free IDEs and text editors such as Sublime, emacs. There's also non-free ones such as PyCharm by Jetbrains.

Run PySpark from IPython notebook

Ipython notebook is sort of similar to Mathematica. Its a web app that allows you to write descriptions, images, visualization and executing run code.

Download the following python setup script from Github to create a new pyspark profile for running Ipython notebook https://github.com/felixcheung/vagrant-projects/blob/master/Spark-IPython-Zeppelin-Lightning/ipython-pyspark.py Run

python ipython-pyspark.py

Start ipython notebook from terminal

ipython notebook

Open your browser and navigate to to view the ipython notebooks:

localhost:1088

If you would like to change to a different port, modify the following line in the ipython-pyspark.py script

ip = '*' # Warning: this is potentially insecure
port = <new-port>

Create a new notebook by

New-> Python(2 or 3)
Press + to create new cell
Press play icon to run (or ctrl +enter)

Test that Pyspark is working from Ipython notebook by pasting the following to a cell and hit run

import math
testRdd = sc.parallelize([16,16,9])
testRdd.map(math.sqrt).collect()

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
LICENSE		LICENSE
README.md		README.md
Spark-vs-R-pearson.png		Spark-vs-R-pearson.png
spark-dataframes-and-ml-pipelines-11-638.jpg		spark-dataframes-and-ml-pipelines-11-638.jpg
spark-pyspark-mllib-101.ipynb		spark-pyspark-mllib-101.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setting up IPython Notebook with PySpark

Requirements

Install Spark

Install Python

Install Anaconda

Run PySpark from IPython notebook

About

Releases

Packages

License

waichee/pyspark-ipython-notebook

Folders and files

Latest commit

History

Repository files navigation

Setting up IPython Notebook with PySpark

Requirements

Install Spark

Install Python

Install Anaconda

Run PySpark from IPython notebook

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages