-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put this project to rest #168
Labels
Comments
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-versionista-scraper
that referenced
this issue
Jan 23, 2023
This project has effectively been in pure maintenance mode for a while, with no known active users. It no longer makes sense to keep accepting and merging regular dependency updates, so this turns off Dependabot version updates (we still have Dependabot *security* updates). See also edgi-govdata-archiving/web-monitoring#168.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-versionista-scraper
that referenced
this issue
Jan 23, 2023
This project has effectively been in pure maintenance mode for a while, with no known active users. It no longer makes sense to keep accepting and merging regular dependency updates, so this turns off Dependabot version updates (we still have Dependabot *security* updates). See also edgi-govdata-archiving/web-monitoring#168.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-db
that referenced
this issue
Feb 15, 2023
This project is moving into maintenance mode; it does not make sense to continue making regular, non-security-related dependency updates. See edgi-govdata-archiving/web-monitoring#168 for more.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ui
that referenced
this issue
Feb 15, 2023
This project is moving into maintenance mode; it does not make sense to continue making regular, non-security-related dependency updates. See edgi-govdata-archiving/web-monitoring#168 for more.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-processing
that referenced
this issue
Feb 15, 2023
This project is moving into maintenance mode; it does not make sense to continue making regular, non-security-related dependency updates. See edgi-govdata-archiving/web-monitoring#168 for more.
This was referenced Feb 15, 2023
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-processing
that referenced
this issue
Feb 15, 2023
This project is moving into maintenance mode; it does not make sense to continue making regular, non-security-related dependency updates. See edgi-govdata-archiving/web-monitoring#168 for more.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ui
that referenced
this issue
Feb 15, 2023
This project is moving into maintenance mode; it does not make sense to continue making regular, non-security-related dependency updates. See edgi-govdata-archiving/web-monitoring#168 for more.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-db
that referenced
this issue
Feb 15, 2023
This project is moving into maintenance mode; it does not make sense to continue making regular, non-security-related dependency updates. See edgi-govdata-archiving/web-monitoring#168 for more.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ops
that referenced
this issue
Feb 16, 2023
Instead of running the import job as a cron script on a random EC2 VM, run it as an actual CronJob in Kubernetes with everything else. This also cleans up the docs around jobs. Work not visible here: created a new IAM account for jobs that can write to relevant S3 buckets, added ability to store cache files in S3 (edgi-govdata-archiving/web-monitoring-processing#849) since we have no persistent storage in Kubernetes. Why do this now? See: - edgi-govdata-archiving/web-monitoring#168 - edgi-govdata-archiving/web-monitoring-processing#757
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ops
that referenced
this issue
Feb 16, 2023
As part of putting this project to rest (edgi-govdata-archiving/web-monitoring#168), I put the production API behind CloudFront & WAF. This adds documentation for the current configuration. Fixes #42.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ops
that referenced
this issue
Feb 16, 2023
Instead of running the import job as a cron script on a random EC2 VM, run it as an actual CronJob in Kubernetes with everything else. This also cleans up the docs around jobs. Work not visible here: created a new IAM account for jobs that can write to relevant S3 buckets, added ability to store cache files in S3 (edgi-govdata-archiving/web-monitoring-processing#849) since we have no persistent storage in Kubernetes. Why do this now? See: - edgi-govdata-archiving/web-monitoring#168 - edgi-govdata-archiving/web-monitoring-processing#757
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ops
that referenced
this issue
Feb 17, 2023
Instead of running the import job as a cron script on a random EC2 VM, run it as an actual CronJob in Kubernetes with everything else. This also cleans up the docs around jobs. Why do this now? See: - edgi-govdata-archiving/web-monitoring#168 - edgi-govdata-archiving/web-monitoring-processing#757 Work not visible here: - Created a new IAM account for jobs that can write to relevant S3 buckets. - Added ability to store cache files in S3 (edgi-govdata-archiving/web-monitoring-processing#849) since we have no persistent storage in Kubernetes.
Mr0grog
added a commit
that referenced
this issue
Mar 9, 2023
This is part of #168. EDGI is no longer making active use of the tools here, and they never became generalized to the point where they aren’t a huge amount of effort to deploy and maintain by other organizations or individuals (for example, you need a close partnership with the Internet Archive or another organization that crawls/scrapes/archives the monitored URLs). The goal here is to make the status of things clear and provide some useful resources for anybody looking at this project.
Mr0grog
added a commit
that referenced
this issue
Mar 13, 2023
This is part of #168. EDGI is no longer making active use of the tools here, and they never became generalized to the point where they aren’t a huge amount of effort to deploy and maintain by other organizations or individuals (for example, you need a close partnership with the Internet Archive or another organization that crawls/scrapes/archives the monitored URLs). The goal here is to make the status of things clear and provide some useful resources for anybody looking at this project.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-versionista-scraper
that referenced
this issue
Mar 22, 2023
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-db
that referenced
this issue
Jul 17, 2023
Note the non-maintained status of this project. Part of edgi-govdata-archiving/web-monitoring#168.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-processing
that referenced
this issue
Jul 17, 2023
Note the non-maintained status of this project. Part of edgi-govdata-archiving/web-monitoring#168.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ui
that referenced
this issue
Jul 17, 2023
Note the non-maintained status of this project. Part of edgi-govdata-archiving/web-monitoring#168.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-processing
that referenced
this issue
Jul 17, 2023
Note the non-maintained status of this project. Part of edgi-govdata-archiving/web-monitoring#168.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ui
that referenced
this issue
Jul 17, 2023
Note the non-maintained status of this project. Part of edgi-govdata-archiving/web-monitoring#168.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-db
that referenced
this issue
Jul 17, 2023
Note the non-maintained status of this project. Part of edgi-govdata-archiving/web-monitoring#168.
24 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It is time to put this project more-or-less to rest. EDGI no longer makes active use of most of the web monitoring in this and the associated web-monitoring-xyz repos, although the web governance team (the team was renamed) currently wants it to keep running in a minimal capacity. At current, I feel a lot of responsibility to keep close watch over the running servers and tools, but that needs to stop.
This does not apply to two subprojects that are actively used outside EDGI:
Anyway, things that probably need to get done here in order for me to largely step away:
❌
Stop monitoring as many currently active URLs as possible. As of December 2022, we cut the number of URLs in half, and I am going to look through and see what else might make sense to turn off.(2023-01-24 Additional work here is probably not worth spending effort on. I’m going to leave it be.)Fix issues with bad page titles because of responses that are errors: Page titles should not be updated by versions that are errors web-monitoring-db#751 (PR: Only update the page title for non-error versions web-monitoring-db#1061)
Do not require auth for reading the API. One thing auth is currently doing for us is preventing broader abuse of the API, so there are some sizable sub-issues here:
Use efficient range-based (instead of offset-based) pagination for versions (where this is a huge problem) and optionally other models. This also means limiting the options for the
sort
parameter. Rethink Pagination web-monitoring-db#579❌
Maybe denormalize important fields we currently have to do complex and expensive joins for: Denormalize(2023-02-06: This would be nice, but isn’t really necessary.)Page#latest
andPage#earliest
web-monitoring-db#858 (Otherwise remove those querying features entirely, see below)Remove any API parameters that can cause expensive table scanning (or require auth to use them). (Audit API for parameters that can heavily de-optimize requests web-monitoring-db#1070)
❌
Possibly do all the above as part of a v1 (instead of v0) API.(Update: I wound up gating most features on being logged in rather than removing them, so I don’t think there’s a major reason to mess with versioning.)Put the API behind CloudFront or API Gateway. (Put API behind CloudFront or API Gateway web-monitoring-ops#42)
Do not require auth for requests that read non-user-related tables. (Make API publicly accessible for reading non-user data web-monitoring-db#1069)
Do not require auth when browsing the UI (Make UI publicly accessible for viewing web-monitoring-ui#1025)
Do not require credentials in Python DB client (Don't require credentials in DB client web-monitoring-processing#844)
Import reliability issues!
Impose hard time limits on the import job in case it hangs (it does every so often, and seems to be some sort of issue in either Python requests/urllib3, but also 🤷). It’s possible a switch to httpx in wayback could help alleviate this, but nothing except crashing the job after N hours is a true panacea. (Done in edgi-govdata-archiving/web-monitoring-ops@6c1ba74)
Run the importer on some kind of job scheduling system (Kube CronJob, GH Action via https://cirun.io, etc.) There’s this issue: Rewrite Dockerfile to run imports and healthcheck and run it on ECS web-monitoring-processing#757 and Forest G had some nice thoughts about approaches for this kind of stuff in https://tinyletter.com/slow-news/letters/the-mgdo-stack (in particular, cirun.io looks lovely).
Make sure we are using bigint (not int) for IDs here. (Should actually just do this for all tables with non-UUID primary keys) Use bigint for all non-uuid primary keys web-monitoring-db#1067
❌
(Maybe, not critical) import response bodies directly to S3 as part of the importer instead of as part of the API. Import script should upload bodies directly to S3 web-monitoring-processing#663(2023-02-10: Not going to bother here; things are currently working fine enough.)❌
(Maybe) Consider moving to either EKS or ECS + Terraform Cloud (free tier). Kops is super helpful for managing our own Kube deployment, but it’s still extra, complicated work that would be nice not to impose on anyone who has to touch or maintain this system later. This isn’t really critical, and may not pass the effort vs. return test when I get around to looking at it closely.(2023-02-07: This is too big a change to be a reasonable when putting things to rest. It would be good and make things easier to maintain! But it’s too likely to cause other unknown issues and require time to settle out all the little things.)Turn off auto-annotation/analysis of imported versions in web-monitoring-db. (Disable auto-analysis of new versions web-monitoring-db#1090)
Turn off metrics server. We aren’t actually using it, and it costs money to run (both the EC2 instance and the load balancer). Turn off metrics server web-monitoring-ops#43
Turn off Dependabot updates (keep Dependebot security updates, though)
Shut down staging deployment (Shut down staging deployment web-monitoring-ops#45)
Documentation!
Lessons learned from this project(2023-07-17: Not doing this; not critical to shutdown.)❌
Maybe consider pulling out the Ruby SURT code into a gem. Not sure this is really worth while, though — I asked on the Internet Archive Slack last year and got back only responses that were essentially “that would be nice, but I don’t have real valuable use cases.”Extract SURT into a separate gem web-monitoring-db#767 (2023-07-17: Not worrying about this as part of shutdown. It can always be revisited in the future; the source code is not going away.)❌
Clean up dangling issues and PRs.❌
Consider archiving some of the data here as WARC files, or outputting some archival data as SQLite or similar. I probably won’t actually do this (it’s not clear there’s much to gain from this work), but it’s at least worth investigating once everything else is done.(2023-03-09: it doesn’t appear there’s much utility in doing this. The service continues to run and the database receives backups; the data is accessible for others to do this as a future project if desired.)The text was updated successfully, but these errors were encountered: