Skip to content

Commit

Permalink
Feature/stac ingest check (#74)
Browse files Browse the repository at this point in the history
* Add checks between stac metadata and parquet data upon ingestion

* Update documentation for ingestion process
  • Loading branch information
zacdezgeo authored Oct 2, 2024
1 parent 3a0068b commit ae9f40c
Show file tree
Hide file tree
Showing 7 changed files with 362 additions and 140 deletions.
59 changes: 41 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,31 +16,54 @@ pip install "git+https://github.com/worldbank/DECAT_Space2Stats.git#subdirectory

- Setup the database:

```
docker-compose up -d
```
```
docker-compose up -d
```
- Create a `db.env` file:
```.env
PGHOST=localhost
PGPORT=5439
PGDATABASE=postgis
PGUSER=username
PGPASSWORD=password
PGTABLENAME=space2stats
```
```.env
PGHOST=localhost
PGPORT=5439
PGDATABASE=postgis
PGUSER=username
PGPASSWORD=password
PGTABLENAME=space2stats
```
- Load our dataset into the database
- Ingest the dataset into the database:
```
./postgres/download_parquet.sh
./load_to_prod.sh
```
Use the space2stats-ingest CLI to download the Parquet file from S3 and load it into your database. You’ll also need the STAC metadata file to validate the Parquet schema during ingestion.
To download the Parquet file from S3 and load it into the database:
```
poetry run space2stats-ingest download-and-load \
"s3://<bucket>/space2stats.parquet" \
"postgresql://username:password@localhost:5439/postgis" \
"<path>/space2stats.json" \
--parquet-file "local.parquet"
```
Alternatively, you can download the Parquet file and load it into the database separately:
- Download the Parquet file:
```
poetry run space2stats-ingest download \
"s3://<bucket>/space2stats.parquet" \
--local-path "local.parquet"
```
- Load the Parquet file into the database:
> You can get started with a subset of data for NYC with `./load_nyc_sample.sh` which requires changing your `db.env` value for `PGTABLENAME` to `space2stats_nyc_sample`.
```
poetry run space2stats-ingest load \
"postgresql://username:password@localhost:5439/postgis" \
"<path>/space2stats.json" \
--parquet-file "local.parquet"
```
- Access your data using the Space2stats API! See the [example notebook](notebooks/space2stats_api_demo.ipynb).
- Finally, access your data using the Space2stats API! See the [example notebook](notebooks/space2stats_api_demo.ipynb).
## Usage
Expand Down
22 changes: 19 additions & 3 deletions docs/acceptance/db.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ The input data is stored in Parquet format on AWS S3, located in the file `space
- `hex_id`
- `{variable_name}_{aggregation_method[sum, mean, etc.]}_{year}`

In addition to the Parquet file, a corresponding STAC metadata file is required to ensure that the data structure in the Parquet file matches the metadata specification. The STAC metadata file describes the columns present in the Parquet file and is used to perform schema validation before loading the data into the database.

### Database Setup

You can use a local database for this acceptance test by running:
Expand Down Expand Up @@ -40,6 +42,12 @@ PGTABLENAME=space2stats
### CLI Usage:

You can use the CLI tool for data ingestion, which includes validation of the Parquet file against the STAC metadata file to ensure consistency between the data structure and the metadata.

#### Ingestion Process

The ingestion process now includes an additional parameter for specifying the STAC metadata file, which ensures that the Parquet file schema matches the metadata. This validation step ensures that there are no extra columns in either the Parquet file or the metadata, providing a 1:1 correspondence between them.

You can use the CLI tool for data ingestion. First, ensure you have the required dependencies installed via Poetry:

```bash
Expand All @@ -49,19 +57,27 @@ poetry install
To download the Parquet file from S3 and load it into the database, run the following command:

```bash
poetry run space2stats-ingest download-and-load "s3://yourbucket/space2stats_updated.parquet" "postgresql://postgres:password@localhost:5432/postgis"
poetry run space2stats-ingest download-and-load \
"s3://<bucket>/space2stats.parquet" \
"postgresql://username:password@localhost:5439/postgres" \
"<path>/space2stats.json" \
--parquet-file "local.parquet"
```

Alternatively, you can run the `download` and `load` commands separately:

1. **Download the Parquet file**:
```bash
poetry run space2stats-ingest download "s3://yourbucket/space2stats_updated.parquet" --local-path "local.parquet"
poetry run space2stats-ingest download "s3://<bucket>/space2stats.parquet" --local-path "local.parquet"
```

2. **Load the Parquet file into the database**:
```bash
poetry run space2stats-ingest load "postgresql://postgres:password@localhost:5432/postgis" --parquet-file "local.parquet"
poetry run space2stats-ingest download-and-load \
"s3://<bucket>/space2stats.parquet" \
"postgresql://username:password@localhost:5439/postgres" \
"<path>/space2stats.json" \
--parquet-file "local.parquet"
```

### Database Configuration
Expand Down
Loading

0 comments on commit ae9f40c

Please sign in to comment.