That you for stopping by my Spark project. From the research that I have done so far, Apache Spark is a suitable computing engine and library suite for parallel data processing on computer clusters. In the repo, I coded some basics of Spark using Python. The repo contains codes for Spark DataFrame, working with Operators in Spark and working with missing values. It is not an exhaustive list; this was my getting started working on the tool.
To work with Spark on the local machine, you must install some packages and create a local variable enabling Spark to run on the local machine. To get Spark to work using the notebook on this repo, you need to download some and create local variables. Below are the instructions.
- Requirements for Spark setup in a windows machine
- JDK
- Python
- Hadoop winutiles
- Spark Binaries
- Environmental Variables
- Python IDE (VS Code or Jupyter Notebook)
This repo is created for learning purpose and if you have any interesting of being a contributor or you want to give idea of how to make things better, please let me know