-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload FOSS mods to the Internet Archive, allow clients to fallback #1682
Comments
I'm in favor of option 3 for the following reasons:
As far as the downsides:
A downside you didn't mention:
|
A couple hours ago @techman83 made changes to the indexer such that on merges/commits to the NetKAN repo, inflation of the changes is done automatically and immediately. This all but eliminates the need to touch CKAN-meta (aside from modifying earlier versions of mods). Further, from a personal workflow perspective I would find it far more efficient if all mods/identifiers were available in the NetKAN directory. |
@dbent Excellent points and I'm pleased you've chimed in!
This gets current releases, but that's really cool and not something I'd thought of. I could have the mirrorbot look for the hash and upload the file.It would make noting the hash change cause a re-upload which saves trying to figure out if a Mod needs re-uploading. If the CKAN client will ignore the extra metadata for now, I'd be all for adding a 'download_hash' field or something of that ilk and worrying about how the client implements it later. As a minor implementation detail, if you could have NetKAN cache the file with the hash filename it would save us some double handling and also solves the infrequent need for me to login in and remove faulty cached downloads.
Fair point, I was thinking if we end up with a dodgy zip replacing a good one. But the old one will still be on the Archive with a different hash, we could add 'x-netkan-hash-override' for exceptions if we go to having NetKAN do everything. (This also solidifies my Item per version line of thinking).
It would require an indexing strategy change as inflating ~6000 mods hourly won't scale. However we could create a CKAN-meta-manual (or put a CKAN-meta folder into the NetKAN repo) and put our current set of CKANs into it. That would take care of our initial run of populating the hash of all the files. We can then just use Webhooks to inflate new stuff pushed into it on demand. We might want to cancel that Jenkins job though 😆 An implementation strategy could be: Phase 1
Phase 2
Phase 3 (Though technically this can be done anytime)
Phase 4
Can I just say, I <3 your idea @dbent |
Yeah, back-filling the data is an exercise left to the reader. (Although at some point I do want to implement NetKAN support for dumping metadata for all releases, not just the latest).
Yeah, that can be done.
Absolutely, something I've wanted to do anyway.
Is ~6000 the number of total Initial {
"identifer": "AwesomeMod",
"version": "1.0",
"download": "http://awesomemod.example/download/1.0"
} Which would spit out After a manual update {
"identifer": "AwesomeMod",
"version": "2.0",
"download": "http://awesomemod.example/download/2.0"
} Which would spit out From a rough look there seems to be about ~1400 individual mods in CKAN-meta, versus ~1200 individual mods in NetKAN, so a ~17% increase in NetKAN indexing. If we can create an endpoint for GitHub webhooks and then have authors authorize the NetKAN application to get webhook events, that plus the existing SpaceDock hooks, we could probably scale down our full NetKAN indexing to daily or semi-daily runs. Phases seem good. |
I have ideas for that. So all good!
Ohhhh I see! That makes more sense and would be ok even with the batch indexer as it sits right now. We could configure jenkins to fail all builds and direct people to NetKAN. This can happen anytime. 5620 total ckans and 1188 orphaned by kerbalstuff, the ~6000 was my rough estimate of how many FOSS ckans there are, I was wildly off by the looks 😆
The only issues I see are that it would involve authors configuring webhooks and us somehow mapping them and we wouldn't be able to authenticate the hooks. With 1400 mods, we could crawl 23/minute with a lite scan and have the whole lot checked hourly without belting the endpoints. |
And we have a collection to house them all in now! |
As an aside if we did end up with too many unique .netkans after combining traditional netkans and the ckans-come-netkans we can use 2 extension types (or a field within the file or whatever) that differentiates between metadata that should be inflated/checked routinely and metadata that only needs to be checked when there's a change to the metadata. |
@plague006 we could easily stick them in a separate Directory, then let the Webhooks sort it out. |
It would mostly involve us creating an "application" and then providing a link for authors to authorize us and then we'd setup the webhook ourselves. Manual configuration shouldn't be necessary, just click a link, and then click an authorize button.
This could be done by looking at
If we set a secret in the webhook when we create it GitHub will use HMAC to sign any payloads which we could then verify. |
Good point, that wouldn't be super difficult.
We already scan all the metadata every test. The JSON::XS library is pretty darn fast, scanning through all the netkans takes pretty little time at all, we could even just store it in memory if we wanted to (1400 items is nothing really).
Takes no time at all on my old laptop.
Yeah, we're already verifying the WebHooks we currently receive using HMAC. I also had a thought about the hash filename, how are we going to get that hash without re-downloading the file every time we inflate? We have 8.6GB cached recent files, redownloading that regularly might be a little unfriendly to our AWS credits and the hosts we're scraping. If we move entirely to WebHooks/api/head checks and only inflate exceptions this would reduce the impact significantly. |
I like the idea of only making netkans, so CKAN-meta is only populated automatically. I already use that technique when creating CKANs locally. |
I threw something together to generate hashes, the only thing left to cover off is file extensions. Though looking at the API, we can just loop through the files looking for a matching sha1 (testing shows it matches what GetFileHash generates, except lowercase).
@dbent thoughts? We could write a FileExtTransformer or just hit up the API if we want to look for a download url. |
@techman83 So if I'm understanding correctly... a download URL from IA looks like:
Which CKAN would determine by using the following values from the metadata:
Is the plan to have the filename just be the hash, identifier+hash, identifier+version+hash? So the only other bit we need is what the file extension will be? Am I understanding that correctly? In that case I'm open to two ideas:
{
"download_content_type": "application/zip"
} |
Below is how it currently is, though I'm not tied to it. Being the internet archive, having somewhat human readable names seemed logical.
Yes you got it. Either option works, it's a single api call to get all the metadata about the item
Option 2 is worth considering, because we might end up with multiple mirrors one day and the other mirrors might not have a nice api for us to get the required information. Media type is sensible, easy enough to derive extensions from that, the NetKAN code already does it. |
@dbent @pjf @KSP-CKAN/wranglers The majority of 'Phase 1' is complete. NetKAN produces the Download hashes for the Mirror library to consume. The @KSP-CKAN/wranglers have done an awesome job standardizing our metadata in KSP-CKAN/NetKAN#3890 and as of KSP-CKAN/NetKAN-bot#38 the bots are uploading license compliant CKANs to the Internet Archive. Current workflow looks like: @dbent I've had more time to think, it's likely we'll build a resolver for whichever mirror backend and cycle through them. Using the API allows us to not be dependent on file names, the SHA1 is always going to resolve to the correct file. |
@dbent @pjf @KSP-CKAN/wranglers Phase 1 is completed and Phase 4 is in progress. I setup a separate bot account called 'kspckan-crawler' so we can easily tell which commits the crawler has performed. You will notice a lot of changes go through over the next few weeks. It's currently checking 2 mods every 5 minutes, so at the current rate somewhere around 17 days to go over the entire collection. Probably sooner as there were a large number of mods with download hashes, but not mirrored and those are being checked separately at the same rate (2 mods every 5 minutes). |
@techman83 Excellent, and I see you fixed the bot name. 😄 We really ought to give the bots the same avatar as the KSP-CKAN org: https://avatars1.githubusercontent.com/u/9023970?v=3&s=200 |
@dbent @KSP-CKAN/wranglers ckan-notices should be a lot quieter now. The crawler has completed its task. |
We should probably also add that developers can add several additional download links instead of just one. This would work in combination with this archiving technique. |
An interesting case has come up. We've had the wrong licence for a bunch of Snark's mods since around October last year. KSP-CKAN/NetKAN#5512 is the start of fixing up the licence info in CKAN, but we should update the licence info on archive.org. Is that possible? Is there a way to automate that? |
@politas It is definitely possible, automating it would be reasonably straight forward. Knowing when to trigger that update is the trickier part. |
Proposal
Similar to #935, we can upload FOSS mods to the Internet Archive. There is example code over at KSP-CKAN/NetKAN-bot for uploading to the IA, the sticking point is the resulting uploaded file.
An Item per Mod Version made sense to me, but open to ideas. As the metadata per version could change.
The big rub is the filename. We have options, that have consequences.
I came up with 2 whilst I was writing this.. It's the option I'm currently leaning towards. Pinging @dbent @pjf being more across internals of the Client code base and @Dazpoet @plague006 @politas being across the metadata.
Implementation
Option 3 was decided on.
Phase 1
Phase 2
Scan NetKAN_repo for any changes we missed git diff $(git rev-list -n1 --before="yesterday" master) --name-only -p NetKAN_repo/CKAN-meta/covered by full inflation as all former ckans are treated the same way as netkans.Phase 3 (Though technically this can be done anytime)
Phase 4
The text was updated successfully, but these errors were encountered: