Skip to content

Latest commit

 

History

History
74 lines (53 loc) · 2.73 KB

README.md

File metadata and controls

74 lines (53 loc) · 2.73 KB

SPARK

Materials for course: Introduction to Big Data with Apache Spark

Structure

  • core - Apache Spark core examples
  • data - data for the exercises
  • docker - Docker used in training
  • exercises - exercise questions
  • notebooks - Jupyter notebooks
  • sql - Apache Spark SQL examples
  • streaming - Apache Spark Streaming examples

Setup

Required software

The below are software packages needed for this course:

  • Git
  • Python 3.4+, installed via Anaconda (contains the majority of necessary packages)
  • PySpark (1.6.0+)

Docker setup

Docker setup requires moderate resources but assures that everyone has a working environment for the training.

Setup steps:

  • Download and install Git https://git-scm.com/downloads
  • Download and install Docker following the instructions:
  • (OS X / Win) Open Docker Quickstart Terminal (use Terminal, not iTerm)
  • Go into this repository
  • Build docker docker-compose build
  • To start Docker run docker-compose up
    • If one of the above docker commands fail, run eval "$(docker-machine env default)" and then the command, e.g. docker-compose build
    • Jupyter runs on port 8888 on localhost on Linux on Docker VM IP available from docker-machine ip on Mac OS X and Windows
    • data and notebooks directories are mounted directly from the host file system
    • Note that the container will close with the current terminal session closure

Potential issues:

  • Setup can take some time as Docker pulls a number of images from the network
  • Docker Toolbox with VirtualBox does not work well with Microsoft HyperV used by the new docker; remove HyperV before installing Docker Toolbox
  • Sometimes Docker has problem with getting IPs on restrictive networks
  • Put this repository into your home directory as Docker can have issues with mounting folders that are places outside of the home directory

Manual setup

This setup requires least resources but can be difficult on Windows machines.

Setup steps:

Building Java

Most of the examples are written in Java 8 apart from Word Count examples, which are written in Java 7 and 8 and Scala; see the file suffixes.

The project is build with Apache Maven (http://maven.apache.org).

mvn clean
mvn install