This repository contains exercises and solutions for a one-day crash course for PySpark and Spark ML. The repository only contains Jupyter Notebooks which assume a working PySpark kernel with Python 3.5 and Spark 2.1.
All notebooks have been create by Kaya Kupferschmidt @ dimajix. In case you have any questions, feel free to contact me at k.kupferschmidt@dimajix.de
This notebook contains some simple snippets to get a basic understanding how to interact with Spark DataFrames in Python.
These notebooks contain the classic word count, implemented with DataFrames.
These notebooks contain a simple linear regression exercise as an introduction to machine learning with Spark.
After being exposed to a simple linear regression, these notebooks contain an exercise to perform a simple statistical text classification.
As with many complex algorithms and ML pipelines, the text classification has many hyper parameters. These notebooks show how to perform hyper parameter tuning with PySpark.