Skip to content

Latest commit

 

History

History
103 lines (67 loc) · 3.79 KB

README.md

File metadata and controls

103 lines (67 loc) · 3.79 KB

NBA stats

1. 💬 Project description

This project aims to build a local database for retrieving NBA data through SQL queries. It consists of two main parts:

  • Scraping: To get raw data
  • Data engineering: To manipulate the data using dbt & duckdb

Datawarehouse documentation: link

2. 📟 Prerequisites

The project uses uv (v0.5.10) to handle python version and dependencies.

3. 🔌 Quickstart

To setup and use the project locally, execute the following steps:

  1. curl -LsSf https://astral.sh/uv/0.5.10/install.sh | sh (Install uv v0.5.10. See doc.)
  2. uv sync (Install virtual environment)
  3. uv run pre-commit install -t commit-msg -t pre-commit (Setup pre-commit)

4. 🚀 Run

4.1. ⚙️ Scraping scripts

This is not necessary to execute it again as the data is already extracted
  • cd ./scraping
  • Generate game_schedule.csv : uv run python get_games_schedule.py
  • Generate game_boxscore.csv : uv run python get_games_boxscore.py

The generated data is then transferred to the sources of the dbt project: cp ./scraping/data/*.parquet ./transform/nba_dwh/local_source/

4.2. ⚙️ Create database

The following section describe the steps to create the local duckdb database, leveraging dbt:

  1. cd ./transform/nba_dwh
  2. uv run dbt deps (Install dbt dependencies)
  3. uv run dbt run (Run transformations)
  4. uv run dbt test (Test pipeline)
  5. uv run dbt docs generate (Generate doc)
  6. uv run dbt docs serve (Launch doc)

4.3. ⚙️ Interact with database

Once the database is created:

  • Open the local db: uv run duckcli ./nba_dwh.duckdb
  • Request data:
-- Career statistics of Rajon Rondo
select p.player_name, s.years, ps.nb_games, ps.avg_points, ps.avg_assists
from player_season ps
inner join player p on p.id = ps.player_id
inner join season s on s.id = ps.season_id
where p.player_name like 'Rajon Rondo'
order by s.years

5. 🔗 Internal Architecture

  • Folder /scraping: Contains scripts to generate the raw data
  • Folder /transform: Contains dbt project to generate the database

6. 🏆 Code Quality and Formatting

  • The python files are linted and formatted using ruff, see configuration in pyproject.toml
  • The dbt sql models files are formatted using sqlfmt
  • Pre-commit configuration is available to ensure trigger quality checks (e.g. linter)
  • Commit messages follow the conventional commit convention

7. 📚 Complementary documentation