- 1. π¬ Project description
- 2. π Prerequisites
- 3. π Quickstart
- 4. π Run
- 5. π Internal Architecture
- 6. π Code Quality and Formatting
- 7. π Complementary documentation
This project aims to build a local database for retrieving NBA data through SQL queries. It consists of two main parts:
- Scraping: To get raw data
- Data engineering: To manipulate the data using dbt & duckdb
Datawarehouse documentation: link
The project uses uv (v0.5.10
) to handle python version and dependencies.
To setup and use the project locally, execute the following steps:
curl -LsSf https://astral.sh/uv/0.5.10/install.sh | sh
(Install uvv0.5.10
. See doc.)uv sync
(Install virtual environment)uv run pre-commit install -t commit-msg -t pre-commit
(Setup pre-commit)
This is not necessary to execute it again as the data is already extracted
cd ./scraping
- Generate
game_schedule.csv
:uv run python get_games_schedule.py
- Generate
game_boxscore.csv
:uv run python get_games_boxscore.py
The generated data is then transferred to the sources of the dbt project:
cp ./scraping/data/*.parquet ./transform/nba_dwh/local_source/
The following section describe the steps to create the local duckdb database, leveraging dbt:
cd ./transform/nba_dwh
uv run dbt deps
(Install dbt dependencies)uv run dbt run
(Run transformations)uv run dbt test
(Test pipeline)uv run dbt docs generate
(Generate doc)uv run dbt docs serve
(Launch doc)
Once the database is created:
- Open the local db:
uv run duckcli ./nba_dwh.duckdb
- Request data:
-- Career statistics of Rajon Rondo
select p.player_name, s.years, ps.nb_games, ps.avg_points, ps.avg_assists
from player_season ps
inner join player p on p.id = ps.player_id
inner join season s on s.id = ps.season_id
where p.player_name like 'Rajon Rondo'
order by s.years
- Folder
/scraping
: Contains scripts to generate the raw data - Folder
/transform
: Contains dbt project to generate the database
- The python files are linted and formatted using ruff, see configuration in
pyproject.toml
- The dbt sql models files are formatted using sqlfmt
- Pre-commit configuration is available to ensure trigger quality checks (e.g. linter)
- Commit messages follow the conventional commit convention
- DBT
- DuckDB
- DBT-DuckDB adapter
- See analysis based on this data, and leveraging bayesian statistics here