Hi there! Welcome to PySpark Playground! This is a collection of PySpark examples I put together while learning Apache Spark. I wanted to document my journey, share my progress, and hopefully help others get started with Spark too.
When I started exploring Apache Spark, I realized how powerful it is for working with big data. But, like anything new, it felt a bit overwhelming at first. So, I decided to create simple, hands-on examples that make it easier to understand the basics. This repository is my way of sharing what I’ve learned so far.
This repository includes a short tutorial in the notebook: pyspark_tutorial_with_examples.ipynb
. It’s packed with practical examples using an ice cream sales dataset—because who doesn’t love ice cream?
- Getting Started: Learn how to set up Apache Spark and PySpark in Google Colab.
- DataFrame Basics: Examples of selecting columns, filtering rows, adding calculated columns, and grouping data.
- Popular Functions: Hands-on with PySpark’s most-used transformations, like “withColumn”, “groupBy”, and aggregations (e.g., average, sum).
- Unique IDs: See how to generate unique IDs for rows using “monotonically_increasing_id”.
- SQL Magic: Combine SQL with PySpark to run custom transformations.
- Real-Life Data: Follow along with an ice cream sales dataset for practical use cases.
- Extra Tricks: Work with dates, timestamps, and other cool features.
- Clone the repository:
git clone https://github.com/AlefRP/pyspark-playground.git
- Open the notebook
pyspark_tutorial_with_examples.ipynb
in Jupyter Notebook or Google Colab. - Follow the examples, tweak the code, and see what happens—learning by doing is the best way!
While I tried to keep this tutorial as clear as possible, Spark’s official documentation has been a lifesaver for me. If you want to dive deeper, I highly recommend checking it out:
- It’s beginner-friendly (I’m a beginner too!).
- Focuses on hands-on learning with fun, real-world examples.
- Shows how to use PySpark in environments like Google Colab.
I hope this helps you get started with PySpark and makes your learning journey a bit easier. Let’s explore big data together!
If you have ideas, improvements, or your own examples, I’d love to see them! Feel free to fork this repository and contribute.
This project is open-source and available under the MIT License.