Welcome to the PySpark Tutorial for Beginners GitHub repository! This repository contains a collection of Jupyter notebooks used in my comprehensive YouTube video: PySpark tutorial for beginners. These notebooks provide hands-on examples and code snippets to help you understand and practice PySpark concepts covered in the tutorial video.
If you find this tutorial helpful, consider sharing this video with your friends and colleagues to help them unlock the power of PySpark and unlock the following bonus videos.
🎁 Bonus Videos:
- Hit 50,000 views to unlock a video about building an end-to-end machine-learning pipeline with PySpark.
- Hit 100,000 views to unlock another video video about end-to-end spark streaming.
Do you like this tutorial? Why not check out my other video of Airflow Tutorial for Beginners, which has more than 350k views 👀 and around 7k likes 👍.
Don't forget to subscribe to my YouTube channel and my blog for more exciting tutorials like this. And connect me on X/Twitter and Linkedin, I post content there regularly too. Thank you for your support! ❤️
In our PySpark tutorial video, we covered various topics, including Spark installation, SparkContext, SparkSession, RDD transformations and actions, Spark DataFrames, Spark SQL, and more. These Jupyter notebooks are designed to complement the video content, allowing you to follow along, experiment, and practice your PySpark skills.
To get started with the Jupyter notebooks, follow these steps:
-
Clone this GitHub repository to your local machine using the following command:
git clone https://github.com/coder2j/pyspark-tutorial.git
-
Ensure you have Python and Jupyter Notebook installed on your machine.
-
Follow the YouTube video part 2: Spark Installation to make sure Spark has been installed on your machine.
-
Launch Jupyter Notebook by running:
jupyter notebook
-
Open the notebook you want to work on and start experimenting with PySpark.
-
Notebook 1 - 01-PySpark-Get-Started: Instructions and commands for setting the PySpark environment variables to use spark in jupyter notebook.
-
Notebook 2 - 02-Create-SparkContext: Creating SparkContext objects in different PySpark versions.
-
Notebook 3 - 03-Create-SparkSession.ipynb: Creating SparkSession objects in PySpark.
-
Notebook 4 - 04-RDD-Operations.ipynb: Creating RDD and Demonstrating RDD transformations and actions.
-
Notebook 5 - 05-DataFrame-Intro.ipynb: Introduction to Spark DataFrames and differences compared to RDD.
-
Notebook 6 - 06-DataFrame-from-various-data-source.ipynb: Creating Spark Dataframe from various data sources.
-
Notebook 7 - 07-DataFrame-Operations.ipynb: Performing Spark Dataframe operations like filtering, aggregation, etc.
-
Notebook 8 - 08-Spark-SQL.ipynb: Converting Spark Dataframe to a temporary table or view and performing SQL operations using Spark SQL.
Feel free to explore and run these notebooks at your own pace.
To make the most of these notebooks, you should have the following prerequisites:
-
Basic knowledge of Python programming.
-
Understanding of data processing concepts (though no prior PySpark experience is required).
These notebooks are meant for self-learning and practice. Follow along with the tutorial video to gain a deeper understanding of PySpark concepts. Experiment with the code, modify it and try additional exercises to solidify your skills.
If you'd like to contribute to this repository by adding more notebooks, improving documentation, or fixing issues, please feel free to fork the repository, make your changes, and submit a pull request. We welcome contributions from the community!
This project is licensed under the MIT License.