The purpose of this project is to demonstrate PySpark functionality on a dataset about the statistics of NBA Players. The dataset is queried and a transformation is done with the output shown in a markdown file.
-
Use PySpark to perform data processing on a large dataset
-
Include at least one Spark SQL query and one data transformation
📦 fan_xu_pyspark
.github
workflows
cicd.yml
Makefile
NBA_24_stats.csv
README.md
__pycache__
script.cpython-312.pyc
gitignore
lib.py
output.md
requirements.txt
script.py
test_lib.py
©generated by Project Tree Generator
- EDA
The first 3 rows are displayed along with summary statistics for the age, assists, and steals columns
- Query
The top 10 highest-scoring players are queried
- Transformation
A column is added to show the assist/turnover ratio of the players
Requirements:
- Python
- PySpark
- Java