-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: validate downloaded files as ZIP before processing dataset #501
Conversation
Most skipped datasets are not valid ZIP files. However, there are a few cases where accessing the <html>
<head><title>410 Gone</title></head>
<body>
<center><h1>410 Gone</h1></center>
<hr><center>openresty</center>
</body>
</html> Here is a list of stable IDs and producer URLs where this issue is known to occur (note that there may be others): |
Here is the list of feeds in PROD with no datasets. Note that this list might reduce if we run the |
To avoid manually deleting files and database entities again, we should consider the following options:
I recommend Option A, depending on how long we need before releasing. However, if the scheduler is paused for too long, we might miss a significant number of historical datasets. Thoughts? cc: @emmambd @davidgamez |
I prefer Option A. We can connect tomorrow at stand-up to make a decision. |
Option A worries me significantly if we're waiting a full month. Let's discuss. |
@cka-y based on your CSV, it looks like there's about 200 feeds that don't have any real value (don't return a dataset). From a quick glance, I think half of these probably are legitimately "deprecated" (meaning I can find a replacement feed for them), but there will also be some where we have no idea if the feed has truly been replaced or if it's just down. We might need an additional issue to return some kind of "non-fetchable" message in the API response and in the website, for this use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good and straight forward to understand ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary:
This update enhances batch processing by validating that the producer URL returns a zipped file before saving to GCP and updating the database. As part of this improvement, I also removed from GCP and the database all datasets and related entities that were not zipped files and not linked to other records. Below are the details of all invalid datasets per environment:
Expected Behavior:
The system should check if the producer URL returns a zipped file. If the URL does not return a zipped file, the dataset should not be stored in GCP or persisted in the database.
Testing Tips:
FEED_LIMIT
variable inbatch-datasets
to cover a wide range of datasets (default is 10).is not a valid ZIP file
, which is logged as anERROR
level message in thebatch-process-dataset
cloud function.Please make sure these boxes are checked before submitting your pull request - thanks!
./scripts/api-tests.sh
to make sure you didn't break anything