#

pyspark

Here are 1,149 public repositories matching this topic...

dashmug / glue-utils

Python library designed to enhance the developer experience when working with AWS Glue ETL and Python Shell jobs. It reduces boilerplate code, increases type safety, and improves IDE auto-completion, making Glue development easier and more efficient.

python aws spark etl pyspark data-engineering elt aws-glue

Updated Jul 15, 2024
Python

canimus / cuallee

Possibly the fastest DataFrame-agnostic quality check library in town.

unit-testing bigdata pandas python3 performance-metrics pyspark data-quality-checks data-quality dataquality snowpark pydeequ

Updated Jul 15, 2024
Python

ibis-project / ibis

the portable Python dataframe library

Updated Jul 15, 2024
Python

RePlay

sb-ai-lab / RePlay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

machine-learning deep-learning algorithms evaluation distributed-computing transformers pytorch collaborative-filtering matrix-factorization pyspark recsys recommender-system recommendation-algorithms

Updated Jul 15, 2024
Python

MrPowers / quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

apache-spark pyspark

Updated Jul 15, 2024
Python

iobruno / data-engineering-zoomcamp

Data Engineering examples covering Airflow and Mage for workflows; dbt for BigQuery, Redshift, ClickHouse; Spark and Kafka for Batch/Streaming Processing

kafka spark pyspark kafka-streams spark-sql workflow-orchestration ksqldb dbt-bigquery dbt-postgres dbt-clickhouse dbt-redshift

Updated Jul 15, 2024
Python

ev2900 / Glue_Aggregate_Small_Files

PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue

aws s3 glue pyspark small-files

Updated Jul 13, 2024
Python

ev2900 / Glue_Examples

PySpark code samples designed for AWS Glue

aws glue pyspark aws-glue

Updated Jul 13, 2024
Python

lykmapipo / Python-Spark-Log-Analysis

Python scripts to process, and analyze log files using PySpark.

Updated Jul 13, 2024
Python

mitchelllisle / sparkdantic

✨ A Pydantic to PySpark schema library

schema pyspark pydantic

Updated Jul 15, 2024
Python

longNguyen010203 / Spark-Processing-AWS

👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) Terraform to setup the infrastructure and Integration Airflow to automate workflows🥊

aws apache-spark terraform aws-s3 iam pyspark cloud-computing aws-ec2 redshift data-pipeline aws-services apache-airflow emr-cluster spark-cluster spark-master spark-worker

Updated Jul 12, 2024
Python

databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

python spark faker pyspark spark-streaming data-generation databricks synthetic-data datagen datagenerator deltalake datageneration delta-live-tables

Updated Jul 15, 2024
Python

KiarashYavari / avg_close_price

Scenario: You have CSV files containing daily stock prices with columns: Date, Ticker, Open, High, Low, Close, and Volume. You want to compute the daily average closing price for each stock.

sql spark python3 pyspark batch-processing

Updated Jul 12, 2024
Python

capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!

python data-science data spark numpy pandas pyspark compare dask dataframes fugue polars

Updated Jul 11, 2024
Python

NHSDigital / datascience-seminars

A repo to hold resources and code from data science seminars. For more info contact datascience@nhs.net.

python data-science health python3 pyspark healthcare nhs rap reproducible-analytical-pipeline nhs-digital

Updated Jul 10, 2024
Python

jupyter-incubator / sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

magic spark kernel jupyter notebook cluster pandas-dataframe jupyter-notebook sql-query pyspark kerberos livy

Updated Jul 9, 2024
Python

koheesio

Nike-Inc / koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

python pyspark data-engineering pydantic delta-lake

Updated Jul 9, 2024
Python

astrolabsoftware / fink-filters

Define your filters to create your alert stream in Fink!

python extension astronomy pyspark broker

Updated Jul 15, 2024
Python

baranylcn / churn_w_pyspark

python big-data pyspark churn

Updated Jul 8, 2024
Python

rohitgoyal1999 / Stock_Market_Data_Pipeline

This project is designed to fetch historical stock data for a predefined list of companies and store it in a MySQL database. It utilizes Python, Spark, and the Alpha Vantage API to perform data extraction, transformation, and loading (ETL) tasks.

github api sql pyspark stocks dataengineering

Updated Jul 8, 2024
Python

Improve this page

Add a description, image, and links to the pyspark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pyspark topic, visit your repo's landing page and select "manage topics."