Read more on Medium:
- Part 1: Meet "spark-sight": Spark Performance at a Glance
- Part 2: “spark-sight” Shows Spill: Skewed Data and Executor Memory
spark-sight is a less detailed, more intuitive representation of what is going on inside your Spark application in terms of performance:
- CPU time spent doing the “actual work”
- CPU time spent doing shuffle reading and writing
- CPU time spent doing serialization and deserialization
- Spill intensity per executor (v0.1.8 or later)
- (coming) Memory usage per executor
spark-sight is not meant to replace the Spark UI altogether, rather it provides a bird’s-eye view of the stages allowing you to identify at a glance which portions of the execution may need improvement.
The Plotly figure consists of charts with synced x-axis.
The top chart shows efficiency in terms of CPU cores available for tasks
The middle chart shows spill information
The bottom chart shows stage timeline
pip install spark-sight
- Pandas: Powerful Python data analysis toolkit for support with Spark event log ingestion and processing
- Plotly: The interactive graphing library for Python for the awesome interactive UI
spark-sight --help
_ _ _ _
___ _ __ __ _ _ __| | __ ___(_) __ _| |__ | |_
/ __| '_ \ / _` | '__| |/ /____/ __| |/ _` | '_ \| __|
\__ \ |_) | (_| | | | <_____\__ \ | (_| | | | | |_
|___/ .__/ \__,_|_| |_|\_\ |___/_|\__, |_| |_|\__|
|_| |___/
usage: spark-sight [-h] [--path path] [--cpus cpus] [--deploy_mode [deploy_mode]]
Spark performance at a glance.
optional arguments:
-h, --help show this help message and exit
--path path Local path to the Spark event log
--cpus cpus Total CPU cores of the cluster
--deploy_mode [deploy_mode]
Deploy mode the Spark application was submitted with. Defaults to cluster deploy mode
spark-sight \
--path "/path/to/spark-application-12345" \
--cpus 32 \
--deploy_mode "cluster_mode"
A new browser tab will be opened.
spark-sight `
--path "C:\path\to\spark-application-12345" `
--cpus 32 `
--deploy_mode "cluster_mode"
A new browser tab will be opened.