Skip to content

KristianHolsheimer/pyspark-setup-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Spark + pyspark setup guide

This is guide for installing and configuring an instance of Apache Spark and its python API pyspark on a single machine running ubuntu 15.04.

-- Kristian Holsheimer, July 2015


Table of Contents

  1. Install Requirements

    1.1 Install Java

    1.2 Install Scala

    1.3 Install git

    1.4 Install py4j

  2. Set Up Apache Spark

    2.1 Download source

    2.2 Compile source

    2.3 Install files

  3. Examples

    3.1 Hello World: Word Count


In order to run Spark, we need Scala, which in turn requires Java. So, let's install these requirements first

1 | Install Requirements

1.1 | Install Java

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

Check if installation was successful by running:

$ java -version

The output should be something like:

java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

1.2 | Install Scala

Download and install deb package from scala-lang.org:

$ cd ~/Downloads
$ wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb
$ sudo dpkg -i scala-2.11.7.deb

Note: You may want to check if there's a more recent version. At the time of this writing, 2.11.7 was the most recent stable release. Visit the Scala download page to check for updates.

Again, let's check whether the installation was successful by running:

$ scala -version

which should return something like:

Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL

1.3 | Install git

We shall install Apache Spark by building it from source. This procedure depends implicitly on git, thus be sure install git if you haven't already:

$ sudo apt-get -y install git

1.4 | Install py4j

PySpark requires the py4j python package. If you're running a virtual environment, run:

$ pip install py4j

otherwise, run:

$ sudo pip install py4j

2 | Install Apache Spark

2.1 | Download and extract source tarball

$ cd ~/Downloads
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0.tgz
$ tar xvf spark-1.6.0.tgz

Note: Also here, you may want to check if there's a more recent version: visit the Spark download page.

2.2 | Compile source

$ cd ~/Downloads/spark-1.6.0
$ sbt/sbt assembly

This will take a while... (approximately 20 ~ 30 minutes)

After the dust settles, you can check whether Spark installed correctly by running the following example that should return the number π ≈ 3.14159...

$ ./bin/run-example SparkPi 10

This should return the line:

Pi is roughly 3.14042

Note: You want to lower the verbosity level of the log4j logger. You can do so by running editing your the log4j properties file (assuming we're still inside the ~/Downloads/spark-1.4.0 folder):

$ cp conf/log4j.properties.template conf/log4j.properties
$ nano conf/log4j.properties

and replace the line:

log4j.rootCategory=INFO, console

by

log4j.rootCategory=ERROR, console

2.3 | Install files

$ sudo mv ~/Downloads/spark-1.6.0 /opt/
$ sudo ln -s /opt/spark-1.6.0 /opt/spark

Add this to your path by editing your bashrc file:

$ nano ~/.bashrc

Add the following lines at the bottom of this file:

# needed for Apache Spark
export SPARK_HOME=/opt/spark
export PYTHONPATH=$SPARK_HOME/python

Restart bash to make use of these changes by running:

$ . ~/.bashrc

If your ipython instance somehow doesn't find these environment variables for whatever reason, you could also make sure they are set when ipython spins up. Let's add this to our ipython settings by creating a new python script named load_spark_environment_variables.py in the default profile startup folder:

$ nano ~/.ipython/profile_default/startup/load_spark_environment_variables.py

and paste the following lines in this file:

import os
import sys

if 'SPARK_HOME' not in os.environ:
    os.environ['SPARK_HOME'] = '/opt/spark'

if '/opt/spark/python' not in sys.path:
    sys.path.insert(0, '/opt/spark/python')

3 | Examples

Now we're finally ready to start running our first PySpark application. Load the spark context by opening up a python interpreter (or ipython / ipython notebook) and running:

>>> from pyspark import SparkContext
>>> sc = SparkContext()

The spark context variable sc is your gateway towards everything sparkly.

3.1 | Hello World: Word Count

Check out the notebook spark_word_count.ipynb.

About

A guide for setting up Spark + PySpark under Ubuntu linux

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published