Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shut Down & Archive Web Monitoring Projects #170

Open
5 of 24 tasks
Mr0grog opened this issue Jul 17, 2023 · 3 comments
Open
5 of 24 tasks

Shut Down & Archive Web Monitoring Projects #170

Mr0grog opened this issue Jul 17, 2023 · 3 comments
Assignees

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Jul 17, 2023

In #168, we ramped down to barebones maintenance and minimized what services we were running in production. That’s served the project well for the first half of 2023, but funding is drying up and it’s now time to shut down things entirely.

This does not apply to two subprojects that are actively used outside EDGI:

  1. wayback
  2. web-monitoring-diff

To Do:

  • Stop the daily IA import cron job.

  • Stop the daily IA healthcheck cron job (that checks whether our capturer over at IA is still running and capturing the URLs we care about) since it is no longer relevant.

  • Make DB API write-only, shut down import worker.

  • Investigate methods for archiving existing data. We have metadata about pages & versions (archived snapshots of a URL) in a Postgres database, raw response bodies in S3, and analyst reviews of changes in Google Sheets (not sure if we want to archive these or not).

    • @gretchengehrke and I are talking with Internet Archive folks about ways to store things there, if possible/relevant.
    • Alternately, I can look into gzipping or brotli encoding everything in S3 (~963.1 GB).
    • We could also delete everything in S3 that is also available from the Internet Archive Wayback Machine.
    • Everything is individual hash-addressed files. We may want to combine them into larger indexable blocks (maybe using some hash database format or something), since most files are HTML and relatively small (especially after compression, see above).
    • ✅ / ❌ Do we want to save diffs from Versionista? Our archived data from them is not just response bodies, but also textual and HTML diffs, which probably aren’t as important. (Update: yes if we are just leaving things in the S3 buckets, no if not.)
    • ❓ Do we want to archive and save analyst sheets or the important changes sheets? If so, look into publishing them as CSVs or as SQLite. (Update: Not high priority, but would be nice, especially if we are putting things in IA.)
    • ❌ Do we want to save import requests? They contain the raw metadata that was imported, not just the current state of the DB. I think probably no, but worth considering. (OTOH, I’m pretty certain we don’t want to save import warning/error logs.) (Update: NO.)
  • Archive the data somewhere.

    • If this is in a public space, get it done before replacing the UI & API with a tombstone page (so we can link it).
    • Otherwise this is just some physical hard drives in people's possession.
  • Replace https://monitoring.envirodatagov.org/ and https://api.monitoring.envirodatagov.org/ with a tombstone page describing the project and its current status, where to find archives if publicly available, etc.

    • This will probably be GitHub pages (maybe maintained in this repo).
    • @gretchengehrke is working on copy for this.
  • Shut down all running services and resources in AWS.

  • Clean up dangling, irrelevant issues and PRs in all repos. PRs should generally be closed. I like to keep issues that someone forking the project might want to address open, but close others that would not be relevant in that context.

  • Update maintenance status notices if needed on repo READMEs.

    • web-monitoring
    • web-monitoring-ui
    • web-monitoring-db
    • web-monitoring-processing
    • web-monitoring-ops
    • web-monitoring-versionista-scraper
  • Archive all relevant repos.

    • web-monitoring
    • web-monitoring-ui
    • web-monitoring-db
    • web-monitoring-processing
    • web-monitoring-ops
    • web-monitoring-versionista-scraper
@Mr0grog Mr0grog self-assigned this Jul 17, 2023
@Mr0grog Mr0grog pinned this issue Jul 17, 2023
@Mr0grog Mr0grog changed the title Shut down & archive Web Monitoring projects Shut Down & Archive Web Monitoring Projects Jul 17, 2023
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this issue Jul 17, 2023
Part of edgi-govdata-archiving/web-monitoring#170 (shutting down all production services).
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-db that referenced this issue Jul 17, 2023
setting the `API_READ_ONLY` environment variable blocks API operations that create/update/delete versions, annotations, etc. Since we ware now working towards shutting down the entire system, this is helpful so we can work on creating archives without worrying about the database changing from underneath us. See edgi-govdata-archiving/web-monitoring#170.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-db that referenced this issue Jul 17, 2023
setting the `API_READ_ONLY` environment variable blocks API operations that create/update/delete versions, annotations, etc. Since we ware now working towards shutting down the entire system, this is helpful so we can work on creating archives without worrying about the database changing from underneath us. See edgi-govdata-archiving/web-monitoring#170.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this issue Jul 24, 2023
This is part of a full shutdown of all services. See edgi-govdata-archiving/web-monitoring#170.
@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 3, 2023

Quick updates:

  • Still waiting for some feedback from IA folks about archiving there: what stuff they’ll accept, what formats, etc.
  • If we change the format of data in S3 or remove those buckets, we'll need to update the links in the public Enviro Fed Web Tracker Google Sheet.
  • Important changes sheets and analyst sheets would be nice, but the above “Enviro Fed Web Tracker” sheet is already something of a public version of the important changes sheet, so this is already kind of covered.
    • These are also somewhat well preserved in Google Drive, so maybe that’s good enough.
    • If we wind up able to store this kind of stuff in IA, that makes this more attractive.
  • Not worth synthesizing WARCs for content originally sourced from IA/Wayback Machine.
  • Yes worth synthesizing WARCs for content originally sourced from Versionista.
    • Note: there are some version records in the database without archived response bodies from Versionista (these are from the very first stages of the project, where it was only meant to be a queryable index into Versionista, not an archive or backup).
    • IA folks suggest warcio (Python) is the best tool for writing WARCs.

@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 3, 2023

Re: combining content-addressed data into larger files, here are some stats on grouping by different length prefixes:

prefix_length groups count min count avg count max bytes min bytes avg bytes max
2 256 52,316 52,816 53,399 3,409,539.55kB 3,484,415.56kB 3,629,627.39kB
3 4,096 3,102 3,301 3,487 198,391.51kB 217,775.97kB 404,263.71kB
4 65,536 142 206 267 8,900.81kB 13,611.00kB 192,541.24kB
5 1,048,576 1 13 34 0.92kB 850.69kB 178,881.65kB

Note this doesn’t account for how big the files will be after compression (conservative guess is 25%-50% the bytes listed in the table).

I think that puts 3 as a good prefix length (large but manageable size files, and not too many of them, though still a lot). 2 might also be reasonable, depending on what we see for typical compression ratios (I think we should avoid files > 1 GB).

Viable formats:

  1. Zip. Widely supported, straightforward, and supports random access (unlike .tar.gz). Good-ish compression.
  2. SQLite Archive. Not as widely supported, but SQLite databases in general are (and this is just a particular database structure). Supports not just random access but all manner of fancy querying; could feasibly work with Datasette + datasette-media plugin). Definitely more complex than zips, though.

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-db that referenced this issue Aug 10, 2023
Work in progress! This adds a rake command to export the contents of the DB into a SQLite file for public archiving. It's *mostly* a pretty straightforward copy of every table/row, but we skip tables that are irrelevant for a public data set (administrative things like GoodJob tables, users, imports, etc.), drop columns with user data, and do some basic conversions.

Part of edgi-govdata-archiving/web-monitoring#170
@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 10, 2023

Added some preliminary tooling for exporting the DB as a SQLite file at edgi-govdata-archiving/web-monitoring-db#1104. It's gonna be big (not sure how much, but my relatively puny local test DB is 46 MB raw, 5.5 MB gzipped), but this approach probably keeps it the most explorable for researchers. (Other alternatives here include gzipped NDJSON files, Parquet, Feather, or CSV [worst option IMO].)

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-processing that referenced this issue Aug 18, 2023
This project is technically deprecated, but I’m doing some work here to support final shutdown and archival of data (edgi-govdata-archiving/web-monitoring#170).

The main goal here is to import old annotations (different schema than the import script was built to support) and some much newer ones that were never brought into the DB proper. I want them in the DB so we can export a nice SQLite archive that's easy for people to dig through, as opposed to collating data from a variety of sources.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-db that referenced this issue Jan 3, 2025
Work in progress! This adds a rake command to export the contents of the DB into a SQLite file for public archiving. It's *mostly* a pretty straightforward copy of every table/row, but we skip tables that are irrelevant for a public data set (administrative things like GoodJob tables, users, imports, etc.), drop columns with user data, and do some basic conversions.

Part of edgi-govdata-archiving/web-monitoring#170
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant