feat: validate downloaded files as ZIP before processing dataset #501

cka-y · 2024-06-19T17:30:32Z

Summary:

This update enhances batch processing by validating that the producer URL returns a zipped file before saving to GCP and updating the database. As part of this improvement, I also removed from GCP and the database all datasets and related entities that were not zipped files and not linked to other records. Below are the details of all invalid datasets per environment:

invalid_datasets_prod.csv $\approx$ 25% of all 8609 datasets are invalid zips.
invalid_datasets_dev.csv
invalid_datasets_qa.csv

Expected Behavior:

The system should check if the producer URL returns a zipped file. If the URL does not return a zipped file, the dataset should not be stored in GCP or persisted in the database.

Testing Tips:

Adjust the FEED_LIMIT variable in batch-datasets to cover a wide range of datasets (default is 10).
Run the function using the following command:

curl -m 30 -X POST https://northamerica-northeast1-mobility-feeds-dev.cloudfunctions.net/batch-datasets-dev \
-H "Authorization: bearer $(gcloud auth print-identity-token)" \
-H "Content-Type: application/json"

Inspect the logs to ensure that each URL identified as not being a zip file is indeed not a zip file. You can filter the logs using the search string is not a valid ZIP file, which is logged as an ERROR level message in the batch-process-dataset cloud function.
Verify that invalid datasets are correctly logged, reported, and omitted from GCP and the database.

Please make sure these boxes are checked before submitting your pull request - thanks!

Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
Add or update any needed documentation to the repo
Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
Linked all relevant issues
Include screenshot(s) showing how this pull request works and fixes the issue(s)

cka-y · 2024-06-19T17:58:12Z

Most skipped datasets are not valid ZIP files. However, there are a few cases where accessing the producer_url via a web browser successfully downloads a valid ZIP file, but programmatically attempting to replicate this process results in an HTML file instead. An example for mdb-683 returns:

<html>
<head><title>410 Gone</title></head>
<body>
<center><h1>410 Gone</h1></center>
<hr><center>openresty</center>
</body>
</html>

Here is a list of stable IDs and producer URLs where this issue is known to occur (note that there may be others):

mdb-232 (returns an empty response when accessed programatically)
mdb-919 (returns an HTML file when accessed programatically)
mdb-683 (returns an HTML file when accessed programatically)

cka-y · 2024-06-19T18:58:09Z

Here is the list of feeds in PROD with no datasets. Note that this list might reduce if we run the batch-datasets function. However, we should not run it before updating the batch-process-dataset to ignore invalid ZIP files; otherwise, I'll need to restart the process. (cc: @emmambd)

prod-feeds-w-no-dataset.csv

cka-y · 2024-06-19T19:02:28Z

To avoid manually deleting files and database entities again, we should consider the following options:

Option A: Pause the batch-datasets scheduler in PROD to prevent storing new datasets, some of which might be invalid ZIP files.
Option B: Update the batch-process-datasets function in PROD by running the datasets-batch-deployer-prod GitHub Action without releasing (since we are not ready for a full release).

I recommend Option A, depending on how long we need before releasing. However, if the scheduler is paused for too long, we might miss a significant number of historical datasets. Thoughts? cc: @emmambd @davidgamez

davidgamez · 2024-06-19T19:48:04Z

To avoid manually deleting files and database entities again, we should consider the following options:

Option A: Pause the batch-datasets scheduler in PROD to prevent storing new datasets, some of which might be invalid ZIP files.

Option B: Update the batch-process-datasets function in PROD by running the datasets-batch-deployer-prod GitHub Action without releasing (since we are not ready for a full release).

I recommend Option A, depending on how long we need before releasing. However, if the scheduler is paused for too long, we might miss a significant number of historical datasets. Thoughts? cc: @emmambd @davidgamez

I prefer Option A. We can connect tomorrow at stand-up to make a decision.

functions-python/helpers/utils.py

emmambd · 2024-06-20T12:38:19Z

Option A worries me significantly if we're waiting a full month. Let's discuss.

emmambd · 2024-06-20T12:51:30Z

@cka-y based on your CSV, it looks like there's about 200 feeds that don't have any real value (don't return a dataset). From a quick glance, I think half of these probably are legitimately "deprecated" (meaning I can find a replacement feed for them), but there will also be some where we have no idea if the feed has truly been replaced or if it's just down.

We might need an additional issue to return some kind of "non-fetchable" message in the API response and in the website, for this use case.

Alessandro100

looks good and straight forward to understand ✅

davidgamez

LGTM

emmambd · 2024-06-20T15:14:29Z

Created #506 and #505 to address what's out of scope for this PR

cka-y added 2 commits June 18, 2024 14:31

feat: validate downloaded files as ZIP before processing

fda143d

fix: added id to logs

06fb3f5

fix: unit test

01ba110

cka-y mentioned this pull request Jun 19, 2024

Programmatic download of valid ZIP files returns HTML responses for certain producer URLs #502

Closed

fix: lint

688cfc2

cka-y marked this pull request as ready for review June 19, 2024 19:02

cka-y linked an issue Jun 19, 2024 that may be closed by this pull request

When a ZIP file is invalid, don't store a dataset #471

Closed

4 tasks

davidgamez reviewed Jun 19, 2024

View reviewed changes

functions-python/helpers/utils.py Outdated Show resolved Hide resolved

fix: using zipfile package

514daa4

jcpitre approved these changes Jun 20, 2024

View reviewed changes

Alessandro100 approved these changes Jun 20, 2024

View reviewed changes

davidgamez approved these changes Jun 20, 2024

View reviewed changes

cka-y merged commit 1cabc46 into main Jun 20, 2024
2 checks passed

cka-y deleted the feat/471 branch June 20, 2024 17:33

cka-y added a commit that referenced this pull request Jun 20, 2024

feat: validate downloaded files as ZIP before processing dataset (#501)

25a967b

cka-y mentioned this pull request Jun 25, 2024

Investigate how often we should be pulling new feed versions MobilityData/mobility-database-catalogs#375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: validate downloaded files as ZIP before processing dataset #501

feat: validate downloaded files as ZIP before processing dataset #501

cka-y commented Jun 19, 2024

cka-y commented Jun 19, 2024

cka-y commented Jun 19, 2024

cka-y commented Jun 19, 2024

davidgamez commented Jun 19, 2024

emmambd commented Jun 20, 2024

emmambd commented Jun 20, 2024

Alessandro100 left a comment

davidgamez left a comment

emmambd commented Jun 20, 2024

feat: validate downloaded files as ZIP before processing dataset #501

feat: validate downloaded files as ZIP before processing dataset #501

Conversation

cka-y commented Jun 19, 2024

cka-y commented Jun 19, 2024

cka-y commented Jun 19, 2024

cka-y commented Jun 19, 2024

davidgamez commented Jun 19, 2024

emmambd commented Jun 20, 2024

emmambd commented Jun 20, 2024

Alessandro100 left a comment

Choose a reason for hiding this comment

davidgamez left a comment

Choose a reason for hiding this comment

emmambd commented Jun 20, 2024