This repo contains code for practising PySpark.
Contents
1. rdd_.ipynb -> This notebook contains basics of RDDs.
2. Pyspark_Intro.ipynb -> This notebook contains code for creating RDDs, PySpark DataFrame from the RDDs, and Pandas DataFrame from PySpark DataFrame.
3. Working_with_Hive_and_PySpark_in_Google_Cloud_Dataproc.ipynb -> This notebook explains how to save PySpark DataFrame in Hive Tables and how to run all these codes on Google Cloud Dataproc.
4. PySpark_Advanced.ipynb -> This notebook delves deep into DataFrames, dealing with different type of data, Spark SQL and some advanced concepts in RDDs.
5. Algoscale_Assignment.ipynb -> This notebook contains solution of the Take Home Assignment Round of the interview for Data Engineer position at AlgoScale.
6. AlgoScale_Interview_Problems.ipynb -> This notebook contains solution of the Technical Round of the interview for Data Engineer position at AlgoScale.