PySpark

The purpose of this project is to demonstrate PySpark functionality on a dataset about the statistics of NBA Players. The dataset is queried and a transformation is done with the output shown in a markdown file.

Requirements

Use PySpark to perform data processing on a large dataset
Include at least one Spark SQL query and one data transformation

Project Structure

📦 fan_xu_pyspark
.github
workflows
cicd.yml
Makefile
NBA_24_stats.csv
README.md
__pycache__
script.cpython-312.pyc
gitignore
lib.py
output.md
requirements.txt
script.py
test_lib.py

©generated by Project Tree Generator

Highlights

EDA

The first 3 rows are displayed along with summary statistics for the age, assists, and steals columns

Query

The top 10 highest-scoring players are queried

Transformation

A column is added to show the assist/turnover ratio of the players

Installation

Requirements:

Python
PySpark
Java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark

Requirements

Project Structure

Highlights

Installation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
Makefile		Makefile
NBA_24_stats.csv		NBA_24_stats.csv
README.md		README.md
gitignore		gitignore
lib.py		lib.py
output.md		output.md
requirements.txt		requirements.txt
script.py		script.py
test_lib.py		test_lib.py

nogibjj/fan_xu_pyspark

Folders and files

Latest commit

History

Repository files navigation

PySpark

Requirements

Project Structure

Highlights

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages