Generate new sources.csv from PostgreSQL database #779

emmambd · 2024-10-25T21:59:16Z

Describe the problem

As discussed, we want to shift away from the catalogs repo being our source of truth so we can include more of the dynamic info that comes from our data pipeline.

Proposed solution

Override location with extracted locations
Generate new spreadsheet from postgres database
Re-generate the spreadsheet every Tuesday and Friday

Alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

davidgamez · 2024-11-05T15:43:01Z

Tasks:

Keep the "legacy" csv file and add a new csv based on the DB information
Investigate if we can use the same shortened URL for the new sources.csv
Re-generate the spreadsheet every Tuesday and Friday
Test populate script with DB that has feeds not present in the "legacy" csv

jcpitre · 2024-11-12T04:28:00Z

For the bit.ly URL, from what I can gather we can change the destination URL if we have a paying account.

I suggest instead to put the new sources.csv (the one generated from the DB) in place of the old one in the mdb-csv bucket
And rename the old one (the one generated from the json files in the catalog) something like sources-fromCatalog.csv in the same mdb-csv bucket.

This means we will need to change our code to refer to sources-fromCatalog.csv instead of source.csv, but the advantage is that external users using the bit.ly link will start to transparently get the sources.csv generated from the DB.

jcpitre · 2024-11-12T14:18:44Z

@emmambd can you give me an idea of the "dynamic info that comes from our data pipeline" that we want to include in the new csv file?

emmambd · 2024-11-12T14:27:32Z

@jcpitre This approach makes sense to me! Dynamic data mainly includes everything associated with the "latest dataset" that's in the feed response, so

Bounding box changes
Location changes
Feature changes

jcpitre · 2024-11-20T16:29:45Z

@emmambd There's a location.bounding_box.extracted_on column in the current csv. I don't think it's an info we have in the DB. @cka-y can you confirm?

emmambd · 2024-11-20T16:58:16Z

@cka-y But it is in the API? That's clear - works for me

jcpitre · 2024-11-20T20:08:54Z

In the API, There's a downloaded_at for the dataset, a validated_at for the validation report, and and a created_at for the feed.
I don't see anything in the DB about the date-time of extraction.

Maybe we can use validated_at as the bounding box extraction time?

cka-y · 2024-11-20T23:17:02Z

The flow is as follows: when a dataset is uploaded to the datasets bucket, it automatically triggers the generation of the validation report and the extraction of the bounding box (and location). These are two separate processes.

@jcpitre is correct that the location extraction is the only process without an associated timestamp. To address this, we could:

Use the validated_at timestamp, as @jcpitre suggested.
Use the downloaded_at timestamp from the dataset upload.
Leave it blank for now and later add a dedicated timestamp for location extraction, which would require a database update and a modification to the existing cloud function (estimated effort: small, <1 day).

One concern with using validated_at is the potential edge case where a bounding box exists without an associated validation report. While I'm not sure if this scenario has occurred, there’s currently no mechanism to prevent it.

emmambd · 2024-12-12T19:33:38Z

@jcpitre Please don't update the old URL for the spreadsheet - please create a new one (like a v2) instead. And then we'll add it to the docs

mil · 2024-12-17T15:17:39Z

@jcpitre This approach makes sense to me! Dynamic data mainly includes everything associated with the "latest dataset" that's in the feed response, so
* Bounding box changes

* Location changes

* Feature changes

In addition to the dynamic data mentioned by @emmambd above.

There could also be:

Time range metadata (Include service date range in the API response #718)
Filesize metadata
Validator result URL

Passing these 3 extra fields along in the CSV would potentially be very valuable for end-consumers.

emmambd · 2025-01-06T19:06:08Z

@mil Could you share some more context for how you want to use the validation report URL?

mil · 2025-01-10T21:14:51Z

@mil Could you share some more context for how you want to use the validation report URL?

Currently I have an app which pulls GTFS data specified by Mobility Database's CSV (from either the CI bucket mirror or direct URLs). I've had users tell me particular feeds don't work well with the app sometimes.

Having filesize & daterange metadata would be very helpful to address these end-user issues (as if they pull a huge feed or an outdated feed defacto things may not work - beyond my application logic). However also, if the validator URL was passed along this would provide a hardstop way for end-users to check if the feed should work in the app to begin with

emmambd added the enhancement New feature or request label Oct 25, 2024

jcpitre self-assigned this Nov 11, 2024

davidgamez mentioned this issue Dec 5, 2024

feat: feeds operations API function #838

Merged

5 tasks

jcpitre mentioned this issue Dec 5, 2024

chore: Functions python refactoring #850

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate new sources.csv from PostgreSQL database #779

Generate new sources.csv from PostgreSQL database #779

emmambd commented Oct 25, 2024

davidgamez commented Nov 5, 2024

jcpitre commented Nov 12, 2024

jcpitre commented Nov 12, 2024

emmambd commented Nov 12, 2024

jcpitre commented Nov 20, 2024

emmambd commented Nov 20, 2024

jcpitre commented Nov 20, 2024

cka-y commented Nov 20, 2024

emmambd commented Dec 12, 2024

mil commented Dec 17, 2024 •

edited

Loading

emmambd commented Jan 6, 2025

mil commented Jan 10, 2025

Generate new sources.csv from PostgreSQL database #779

Generate new sources.csv from PostgreSQL database #779

Comments

emmambd commented Oct 25, 2024

Describe the problem

Proposed solution

Alternatives you've considered

Additional context

davidgamez commented Nov 5, 2024

jcpitre commented Nov 12, 2024

jcpitre commented Nov 12, 2024

emmambd commented Nov 12, 2024

jcpitre commented Nov 20, 2024

emmambd commented Nov 20, 2024

jcpitre commented Nov 20, 2024

cka-y commented Nov 20, 2024

emmambd commented Dec 12, 2024

mil commented Dec 17, 2024 • edited Loading

emmambd commented Jan 6, 2025

mil commented Jan 10, 2025

mil commented Dec 17, 2024 •

edited

Loading