-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shut Down & Archive Web Monitoring Projects #170
Comments
Part of edgi-govdata-archiving/web-monitoring#170 (shutting down all production services).
setting the `API_READ_ONLY` environment variable blocks API operations that create/update/delete versions, annotations, etc. Since we ware now working towards shutting down the entire system, this is helpful so we can work on creating archives without worrying about the database changing from underneath us. See edgi-govdata-archiving/web-monitoring#170.
setting the `API_READ_ONLY` environment variable blocks API operations that create/update/delete versions, annotations, etc. Since we ware now working towards shutting down the entire system, this is helpful so we can work on creating archives without worrying about the database changing from underneath us. See edgi-govdata-archiving/web-monitoring#170.
This is part of a full shutdown of all services. See edgi-govdata-archiving/web-monitoring#170.
Quick updates:
|
Re: combining content-addressed data into larger files, here are some stats on grouping by different length prefixes:
Note this doesn’t account for how big the files will be after compression (conservative guess is 25%-50% the bytes listed in the table). I think that puts 3 as a good prefix length (large but manageable size files, and not too many of them, though still a lot). 2 might also be reasonable, depending on what we see for typical compression ratios (I think we should avoid files > 1 GB). Viable formats:
|
Work in progress! This adds a rake command to export the contents of the DB into a SQLite file for public archiving. It's *mostly* a pretty straightforward copy of every table/row, but we skip tables that are irrelevant for a public data set (administrative things like GoodJob tables, users, imports, etc.), drop columns with user data, and do some basic conversions. Part of edgi-govdata-archiving/web-monitoring#170
Added some preliminary tooling for exporting the DB as a SQLite file at edgi-govdata-archiving/web-monitoring-db#1104. It's gonna be big (not sure how much, but my relatively puny local test DB is 46 MB raw, 5.5 MB gzipped), but this approach probably keeps it the most explorable for researchers. (Other alternatives here include gzipped NDJSON files, Parquet, Feather, or CSV [worst option IMO].) |
This project is technically deprecated, but I’m doing some work here to support final shutdown and archival of data (edgi-govdata-archiving/web-monitoring#170). The main goal here is to import old annotations (different schema than the import script was built to support) and some much newer ones that were never brought into the DB proper. I want them in the DB so we can export a nice SQLite archive that's easy for people to dig through, as opposed to collating data from a variety of sources.
Work in progress! This adds a rake command to export the contents of the DB into a SQLite file for public archiving. It's *mostly* a pretty straightforward copy of every table/row, but we skip tables that are irrelevant for a public data set (administrative things like GoodJob tables, users, imports, etc.), drop columns with user data, and do some basic conversions. Part of edgi-govdata-archiving/web-monitoring#170
In #168, we ramped down to barebones maintenance and minimized what services we were running in production. That’s served the project well for the first half of 2023, but funding is drying up and it’s now time to shut down things entirely.
This does not apply to two subprojects that are actively used outside EDGI:
To Do:
Stop the daily IA import cron job.
Stop the daily IA healthcheck cron job (that checks whether our capturer over at IA is still running and capturing the URLs we care about) since it is no longer relevant.
Make DB API write-only, shut down import worker.
Investigate methods for archiving existing data. We have metadata about pages & versions (archived snapshots of a URL) in a Postgres database, raw response bodies in S3, and analyst reviews of changes in Google Sheets (not sure if we want to archive these or not).
Archive the data somewhere.
Replace https://monitoring.envirodatagov.org/ and https://api.monitoring.envirodatagov.org/ with a tombstone page describing the project and its current status, where to find archives if publicly available, etc.
Shut down all running services and resources in AWS.
Clean up dangling, irrelevant issues and PRs in all repos. PRs should generally be closed. I like to keep issues that someone forking the project might want to address open, but close others that would not be relevant in that context.
Update maintenance status notices if needed on repo READMEs.
Archive all relevant repos.
The text was updated successfully, but these errors were encountered: