This project allows you to get PySpark with Jupyter Notebook up and running with a single command!
Check out the usage section for installation instructions.
This project uses a container based solution using Docker. It sets up an isolated 'container' that has all required software pre-installed. This is comparable to a virtual machine, but it's much more flexible and has near-native performance. All this is done automatically, with nothing to worry about.
This project provides a custom Docker image with a set of scripts to use and set up a container running Apache Spark with PySpark, Jupyter Notebook and some other common tools.
This container is intended for college, period 7 of HBO-ICT on the HHS, and has various resources and assignments used in class preinstalled!
- Single command to get everything up and running!
- Jupyter Notebook 4.3.0
- PySpark 2.0.1
- Python 3.5.1
- Spark 2.0.1
- Hadoop 2.7
- College resources preinstalled.
- Operating system:
- Linux
- Mac OS X
- Windows 10 Pro, Enterprise or Edu (not Home or Mobile)
- For other Windows versions, follow the special requirements and instructions
- Docker (not Docker toolbox)
- Docker composer (should come with Docker)
- Git
- Important: VT-x / virtualization must be enabled in the BIOS. More information.
- ~1.3GB of free space
Follow these steps to get the container up and running:
- Make sure you meet all requirements above, install missing software.
- Clone the project repository
(
git clone https://github.com/timvisee/hhs-p7-spark-docker.git
) - Change into the project repository
(
cd hhs-p7-spark-docker
) - Important: If on Windows, enable sharing of the drive the project is installed on. More information.
- Install and start the container:
- Linux/OSX:
./start
or./update-and-start
- Windows:
start.bat
orupdate-and-start.bat
- Linux/OSX:
It's recommended to use the ./update-and-start
/ update-and-start.bat
script in the future to start the container. This will automatically fetch
new updates when available.
The installation is started automatically when starting for the first time. The download of the container image might take a long while as it's around 1.3GBs in size.
The following scripts/commands are included in the project:
./start
/start.bat
: Start the container (invokes install if required)./update-and-start
/update-and-start.bat
: Update and start the contanier (invokes install if required)./stop
/stop.bat
: Stop running containers./install
/install.bat
: Install the container./update
/update.bat
: Update the container and all scripts./uninstall
/uninstall.bat
: Uninstall the container and it's resources
To reopen Jupyter Notebook in the browser for a running container,
simply execute the ./start
/ start.bat
script again.
For each commit, a build and test is automatically started using a CI service. Successful builds are automatically deployed, and will be used by everybody using this project with these management scripts.
Service | Branch | Build Status | |
---|---|---|---|
Travis CI | master | View Status | |
Travis CI | last commit | View Status |
First of all, it's not required to build anything yourself. Simply use the start script to get things up and running. Sometimes however, this is useful if you've manually made changes to the docker image configuration, or when you want to debug things.
If you're using a Linux or Mac OS X operating system, it's possible to build
the container image yourself.
Simply use the image/build
script to start a build!
If you've permission to push updates to the main hosted image that everybody
uses, you can also use the following scripts:
Use the image/push
or image/build-and-push
scripts to build and push an
update.
The notebook directory is accessable at ./notebook
in the project repository
when the container is run.
On Windows it is required to enable drive sharing in the Docker settings on the drive the repository and container is installed on. The notebook directory should be in the same location, but it might differ depending on the configuration.
This is probably because you haven't enabled VT-x / virtualization in your BIOS. Enable this and try it again. See the requirements for more information.
Jupyter Notebook should automatically open in your browser when you start the
container.
If the container is already running, simply run ./start
/ start.bat
again
to re-open Jupyer Notebook.
New updates will automatically be installed if available when running the
./update
/ update.bat
script. It's recommended to use the
./update-and-start
/ update-and-start.bat
scripts to start the container
in the future, as this automatically installs any available updates when
avaialable before the container is starting.
Yes, virtualization must be enabled to use this container. The Docker installer should notify about this. Search on Google for how to enable virtualization on your specific system.
Yes, these files are available in a sub directory. See the data directory section.
Yes. All data will remain intact.
The container can be stopped using the ./stop
/ stop.bat
command.
Yes. Even though the container is based on Alpine Linux, Docker ensures that it will run on Windows too.
For Windows 7, 8 or 10 Home/Mobile you must follow the special requirements and installation instructions here.
For Windows 7, 8 or 10 Home/Mobile, you must use Docker Toolbox instead of the normal Docker version to use this project. Please follow the special requirements and installation instructions here.
This project is released under the GNU GPL-3.0 license. Check out the LICENSE file for more information.