Skip to content

Latest commit

 

History

History
52 lines (34 loc) · 2.93 KB

README.md

File metadata and controls

52 lines (34 loc) · 2.93 KB

PySpark Playground

PySpark Apache Spark Learning Journey

Hi there! Welcome to PySpark Playground! This is a collection of PySpark examples I put together while learning Apache Spark. I wanted to document my journey, share my progress, and hopefully help others get started with Spark too.

🌟 Why I Made This

When I started exploring Apache Spark, I realized how powerful it is for working with big data. But, like anything new, it felt a bit overwhelming at first. So, I decided to create simple, hands-on examples that make it easier to understand the basics. This repository is my way of sharing what I’ve learned so far.

📘 What You'll Find Here

This repository includes a short tutorial in the notebook: pyspark_tutorial_with_examples.ipynb. It’s packed with practical examples using an ice cream sales dataset—because who doesn’t love ice cream?

🛠️ What’s Inside:

  • Getting Started: Learn how to set up Apache Spark and PySpark in Google Colab.
  • DataFrame Basics: Examples of selecting columns, filtering rows, adding calculated columns, and grouping data.
  • Popular Functions: Hands-on with PySpark’s most-used transformations, like “withColumn”, “groupBy”, and aggregations (e.g., average, sum).
  • Unique IDs: See how to generate unique IDs for rows using “monotonically_increasing_id”.
  • SQL Magic: Combine SQL with PySpark to run custom transformations.
  • Real-Life Data: Follow along with an ice cream sales dataset for practical use cases.
  • Extra Tricks: Work with dates, timestamps, and other cool features.

🚀 How to Use This

  1. Clone the repository:
    git clone https://github.com/AlefRP/pyspark-playground.git
  2. Open the notebook pyspark_tutorial_with_examples.ipynb in Jupyter Notebook or Google Colab.
  3. Follow the examples, tweak the code, and see what happens—learning by doing is the best way!

📚 My Go-To Resources

While I tried to keep this tutorial as clear as possible, Spark’s official documentation has been a lifesaver for me. If you want to dive deeper, I highly recommend checking it out:

🤔 Why You Should Check This Out

  • It’s beginner-friendly (I’m a beginner too!).
  • Focuses on hands-on learning with fun, real-world examples.
  • Shows how to use PySpark in environments like Google Colab.

I hope this helps you get started with PySpark and makes your learning journey a bit easier. Let’s explore big data together!

🤝 Contributions Welcome!

If you have ideas, improvements, or your own examples, I’d love to see them! Feel free to fork this repository and contribute.

📄 License

This project is open-source and available under the MIT License.