Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate new sources.csv from PostgreSQL database #779

Open
emmambd opened this issue Oct 25, 2024 · 12 comments
Open

Generate new sources.csv from PostgreSQL database #779

emmambd opened this issue Oct 25, 2024 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@emmambd
Copy link
Contributor

emmambd commented Oct 25, 2024

Describe the problem

As discussed, we want to shift away from the catalogs repo being our source of truth so we can include more of the dynamic info that comes from our data pipeline.

Proposed solution

  • Override location with extracted locations
  • Generate new spreadsheet from postgres database
  • Re-generate the spreadsheet every Tuesday and Friday

Alternatives you've considered

No response

Additional context

No response

@emmambd emmambd added the enhancement New feature or request label Oct 25, 2024
@davidgamez
Copy link
Member

Tasks:

  • Keep the "legacy" csv file and add a new csv based on the DB information
  • Investigate if we can use the same shortened URL for the new sources.csv
  • Re-generate the spreadsheet every Tuesday and Friday
  • Test populate script with DB that has feeds not present in the "legacy" csv

@jcpitre jcpitre self-assigned this Nov 11, 2024
@jcpitre
Copy link
Contributor

jcpitre commented Nov 12, 2024

For the bit.ly URL, from what I can gather we can change the destination URL if we have a paying account.

I suggest instead to put the new sources.csv (the one generated from the DB) in place of the old one in the mdb-csv bucket
And rename the old one (the one generated from the json files in the catalog) something like sources-fromCatalog.csv in the same mdb-csv bucket.

This means we will need to change our code to refer to sources-fromCatalog.csv instead of source.csv, but the advantage is that external users using the bit.ly link will start to transparently get the sources.csv generated from the DB.

@jcpitre
Copy link
Contributor

jcpitre commented Nov 12, 2024

@emmambd can you give me an idea of the "dynamic info that comes from our data pipeline" that we want to include in the new csv file?

@emmambd
Copy link
Contributor Author

emmambd commented Nov 12, 2024

@jcpitre This approach makes sense to me! Dynamic data mainly includes everything associated with the "latest dataset" that's in the feed response, so

  • Bounding box changes
  • Location changes
  • Feature changes

@jcpitre
Copy link
Contributor

jcpitre commented Nov 20, 2024

@emmambd There's a location.bounding_box.extracted_on column in the current csv. I don't think it's an info we have in the DB. @cka-y can you confirm?

@emmambd
Copy link
Contributor Author

emmambd commented Nov 20, 2024

@cka-y But it is in the API? That's clear - works for me

@jcpitre
Copy link
Contributor

jcpitre commented Nov 20, 2024

In the API, There's a downloaded_at for the dataset, a validated_at for the validation report, and and a created_at for the feed.
I don't see anything in the DB about the date-time of extraction.

Maybe we can use validated_at as the bounding box extraction time?

@cka-y
Copy link
Contributor

cka-y commented Nov 20, 2024

The flow is as follows: when a dataset is uploaded to the datasets bucket, it automatically triggers the generation of the validation report and the extraction of the bounding box (and location). These are two separate processes.

@jcpitre is correct that the location extraction is the only process without an associated timestamp. To address this, we could:

  1. Use the validated_at timestamp, as @jcpitre suggested.
  2. Use the downloaded_at timestamp from the dataset upload.
  3. Leave it blank for now and later add a dedicated timestamp for location extraction, which would require a database update and a modification to the existing cloud function (estimated effort: small, <1 day).

One concern with using validated_at is the potential edge case where a bounding box exists without an associated validation report. While I'm not sure if this scenario has occurred, there’s currently no mechanism to prevent it.

@emmambd
Copy link
Contributor Author

emmambd commented Dec 12, 2024

@jcpitre Please don't update the old URL for the spreadsheet - please create a new one (like a v2) instead. And then we'll add it to the docs

@mil
Copy link

mil commented Dec 17, 2024

@jcpitre This approach makes sense to me! Dynamic data mainly includes everything associated with the "latest dataset" that's in the feed response, so

* Bounding box changes

* Location changes

* Feature changes

In addition to the dynamic data mentioned by @emmambd above.

There could also be:

Passing these 3 extra fields along in the CSV would potentially be very valuable for end-consumers.

@emmambd
Copy link
Contributor Author

emmambd commented Jan 6, 2025

@mil Could you share some more context for how you want to use the validation report URL?

@mil
Copy link

mil commented Jan 10, 2025

@mil Could you share some more context for how you want to use the validation report URL?

Currently I have an app which pulls GTFS data specified by Mobility Database's CSV (from either the CI bucket mirror or direct URLs). I've had users tell me particular feeds don't work well with the app sometimes.

Having filesize & daterange metadata would be very helpful to address these end-user issues (as if they pull a huge feed or an outdated feed defacto things may not work - beyond my application logic). However also, if the validator URL was passed along this would provide a hardstop way for end-users to check if the feed should work in the app to begin with

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

5 participants