Skip to content

An extension for Jupyter Lab & Jupyter Notebook to monitor Apache Spark (pyspark) from notebooks

License

Notifications You must be signed in to change notification settings

utkarshgupta137/sparkmonitor

 
 

Repository files navigation

SparkMonitor

An extension for Jupyter Lab & Jupyter Notebook to monitor Apache Spark (pyspark) from notebooks

About

+ =
SparkMonitor is an extension for Jupyter Notebook & Lab that enables the live monitoring of Apache Spark Jobs spawned from a notebook. The extension provides several features to monitor and debug a Spark job from within the notebook interface.

jobdisplay

Requirements

  • Jupyter Lab 3 OR Jupyter Notebook 4.4.0 or higher
  • Local pyspark 2/3 or sparkmagic to connect to a remote spark instance

Features

  • Automatically displays a live monitoring tool below cells that run Spark jobs
  • A table of jobs and stages with progressbars
  • A timeline which shows jobs, stages, and tasks
  • A graph showing number of active tasks & executor cores vs time

Quick Start

Setting up the extension

pip install sparkmonitor # install the extension

# set up an ipython profile and add our kernel extension to it
ipython profile create # if it does not exist
echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >> $(ipython profile locate default)/ipython_kernel_config.py

# For use with jupyter notebook install and enable the nbextension
jupyter nbextension install sparkmonitor --py
jupyter nbextension enable  sparkmonitor --py

# The jupyterlab extension is automatically enabled

Connecting to a local spark instance

With the extension installed, a SparkConf object called conf will be usable from your notebooks. You can use it as follows:

from pyspark import SparkContext
# Start the spark context using the SparkConf object named `conf` the extension created in your kernel.
sc=SparkContext.getOrCreate(conf=conf)

If you already have your own spark configuration, you will need to set spark.extraListeners to sparkmonitor.listener.JupyterSparkMonitorListener and spark.driver.extraClassPath to the path to the sparkmonitor python package path/to/package/sparkmonitor/listener_<scala_version>.jar

from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .config('spark.extraListeners', 'sparkmonitor.listener.JupyterSparkMonitorListener')\
        .config('spark.driver.extraClassPath', 'venv/lib/python3.<X>/site-packages/sparkmonitor/listener_<scala_version>.jar')\
        .getOrCreate()

Connecting to a remote spark instance via sparkmagic

  • Setup sparkmagic & verify everything is working fine

  • Copy the required jar file to the remote spark servers

  • Add listener_<scala_version>.jar to the spark job

    For eg. set spark.jars to https://github.com/swan-cern/sparkmonitor/releases/download/<release>/listener_<scala>.jar

  • Set spark.extraListeners as above

  • Set SPARKMONITOR_KERNEL_HOST environment variable for the spark job using sparkmagic conf

    For yarn, you may use spark.yarn.appMasterEnv to set the variables

Development

If you'd like to develop the extension:

# See package.json scripts for building the frontend
yarn run build:<action>

# Install the package in editable mode
pip install -e .

# Symlink jupyterlab extension
jupyter labextension develop --overwrite .

# Watch for frontend changes
yarn run watch

# Build the spark JAR files
sbt +package

History

Changelog

This repository is published to pypi as sparkmonitor

About

An extension for Jupyter Lab & Jupyter Notebook to monitor Apache Spark (pyspark) from notebooks

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • TypeScript 35.7%
  • Python 33.9%
  • Scala 19.4%
  • CSS 9.1%
  • JavaScript 1.9%