Shut Down & Archive Web Monitoring Projects #170

Mr0grog · 2023-07-17T18:14:03Z

In #168, we ramped down to barebones maintenance and minimized what services we were running in production. That’s served the project well for the first half of 2023, but funding is drying up and it’s now time to shut down things entirely.

This does not apply to two subprojects that are actively used outside EDGI:

To Do:

Part of edgi-govdata-archiving/web-monitoring#170 (shutting down all production services).

setting the `API_READ_ONLY` environment variable blocks API operations that create/update/delete versions, annotations, etc. Since we ware now working towards shutting down the entire system, this is helpful so we can work on creating archives without worrying about the database changing from underneath us. See edgi-govdata-archiving/web-monitoring#170.

This is part of a full shutdown of all services. See edgi-govdata-archiving/web-monitoring#170.

Mr0grog · 2023-08-03T17:04:33Z

Quick updates:

Still waiting for some feedback from IA folks about archiving there: what stuff they’ll accept, what formats, etc.
If we change the format of data in S3 or remove those buckets, we'll need to update the links in the public Enviro Fed Web Tracker Google Sheet.
Important changes sheets and analyst sheets would be nice, but the above “Enviro Fed Web Tracker” sheet is already something of a public version of the important changes sheet, so this is already kind of covered.
- These are also somewhat well preserved in Google Drive, so maybe that’s good enough.
- If we wind up able to store this kind of stuff in IA, that makes this more attractive.
Not worth synthesizing WARCs for content originally sourced from IA/Wayback Machine.
Yes worth synthesizing WARCs for content originally sourced from Versionista.
- Note: there are some version records in the database without archived response bodies from Versionista (these are from the very first stages of the project, where it was only meant to be a queryable index into Versionista, not an archive or backup).
- IA folks suggest warcio (Python) is the best tool for writing WARCs.

Mr0grog · 2023-08-03T17:18:19Z

Re: combining content-addressed data into larger files, here are some stats on grouping by different length prefixes:

prefix_length	groups	count min	count avg	count max	bytes min	bytes avg	bytes max
2	256	52,316	52,816	53,399	3,409,539.55kB	3,484,415.56kB	3,629,627.39kB
3	4,096	3,102	3,301	3,487	198,391.51kB	217,775.97kB	404,263.71kB
4	65,536	142	206	267	8,900.81kB	13,611.00kB	192,541.24kB
5	1,048,576	1	13	34	0.92kB	850.69kB	178,881.65kB

Note this doesn’t account for how big the files will be after compression (conservative guess is 25%-50% the bytes listed in the table).

I think that puts 3 as a good prefix length (large but manageable size files, and not too many of them, though still a lot). 2 might also be reasonable, depending on what we see for typical compression ratios (I think we should avoid files > 1 GB).

Viable formats:

Zip. Widely supported, straightforward, and supports random access (unlike .tar.gz). Good-ish compression.
SQLite Archive. Not as widely supported, but SQLite databases in general are (and this is just a particular database structure). Supports not just random access but all manner of fancy querying; could feasibly work with Datasette + datasette-media plugin). Definitely more complex than zips, though.

Work in progress! This adds a rake command to export the contents of the DB into a SQLite file for public archiving. It's *mostly* a pretty straightforward copy of every table/row, but we skip tables that are irrelevant for a public data set (administrative things like GoodJob tables, users, imports, etc.), drop columns with user data, and do some basic conversions. Part of edgi-govdata-archiving/web-monitoring#170

Mr0grog · 2023-08-10T21:35:39Z

Added some preliminary tooling for exporting the DB as a SQLite file at edgi-govdata-archiving/web-monitoring-db#1104. It's gonna be big (not sure how much, but my relatively puny local test DB is 46 MB raw, 5.5 MB gzipped), but this approach probably keeps it the most explorable for researchers. (Other alternatives here include gzipped NDJSON files, Parquet, Feather, or CSV [worst option IMO].)

This project is technically deprecated, but I’m doing some work here to support final shutdown and archival of data (edgi-govdata-archiving/web-monitoring#170). The main goal here is to import old annotations (different schema than the import script was built to support) and some much newer ones that were never brought into the DB proper. I want them in the DB so we can export a nice SQLite archive that's easy for people to dig through, as opposed to collating data from a variety of sources.

Work in progress! This adds a rake command to export the contents of the DB into a SQLite file for public archiving. It's *mostly* a pretty straightforward copy of every table/row, but we skip tables that are irrelevant for a public data set (administrative things like GoodJob tables, users, imports, etc.), drop columns with user data, and do some basic conversions. Part of edgi-govdata-archiving/web-monitoring#170

Mr0grog added the [priority-★★★] label Jul 17, 2023

Mr0grog self-assigned this Jul 17, 2023

Mr0grog pinned this issue Jul 17, 2023

Mr0grog changed the title ~~Shut down & archive Web Monitoring projects~~ Shut Down & Archive Web Monitoring Projects Jul 17, 2023

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this issue Jul 17, 2023

Shut down daily IA Import cron job

168f12a

Part of edgi-govdata-archiving/web-monitoring#170 (shutting down all production services).

Mr0grog mentioned this issue Jul 17, 2023

Add read-only mode to DB server edgi-govdata-archiving/web-monitoring-db#1102

Merged

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this issue Jul 24, 2023

Delete IA healthcheck job

952d512

This is part of a full shutdown of all services. See edgi-govdata-archiving/web-monitoring#170.

Mr0grog mentioned this issue Aug 3, 2023

Fix deprecation warning MD syntax #171

Merged

Mr0grog mentioned this issue Aug 10, 2023

Use effective_status for Page#status edgi-govdata-archiving/web-monitoring-db#1103

Merged

Mr0grog mentioned this issue Aug 10, 2023

Add support for archiving DB to SQLite edgi-govdata-archiving/web-monitoring-db#1104

Draft

Mr0grog mentioned this issue Aug 18, 2023

Add v1 schema support to annotations import script edgi-govdata-archiving/web-monitoring-processing#853

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shut Down & Archive Web Monitoring Projects #170

Shut Down & Archive Web Monitoring Projects #170

Mr0grog commented Jul 17, 2023 •

edited

Loading

Mr0grog commented Aug 3, 2023 •

edited

Loading

Mr0grog commented Aug 3, 2023

Mr0grog commented Aug 10, 2023 •

edited

Loading

Shut Down & Archive Web Monitoring Projects #170

Shut Down & Archive Web Monitoring Projects #170

Comments

Mr0grog commented Jul 17, 2023 • edited Loading

To Do:

Mr0grog commented Aug 3, 2023 • edited Loading

Mr0grog commented Aug 3, 2023

Mr0grog commented Aug 10, 2023 • edited Loading

Mr0grog commented Jul 17, 2023 •

edited

Loading

Mr0grog commented Aug 3, 2023 •

edited

Loading

Mr0grog commented Aug 10, 2023 •

edited

Loading