-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CKAN's GitHub downloads are breaking the rules #2210
Comments
Well, we're already on track for a point release with a spec update, so it's a good time to do this. We've had projects before to go through old metadata and update it, so adding GitHub download info isn't impossible. (Though it may be impracticable) Even if we just force the current releases to be repopulated, and all new ones going forward, we'll be ahead of the game, and if our trigger for the different download behaviour is the existence of the github_download URL, the worst that will happen is that things work as they do now. I like the multi-pass approach of starting with no token and failing over to a common token. The number of users that would still have issues is probably pretty small (There aren't that many that have issues currently), and adding the ability for users to add their own API token would probably work for that minority of users wanting three figure mod downloads. Could you post something on the CKAN thread explaining what you've found and what users can do to avoid the download throttling in the meantime? (ie, download contents in blocks of 60 mods and wait in between) |
I would love to give users more info, however right now there is no useful info to give. The 60/hour limit assumes use of the API, which is not how the currently released CKAN client works. It instead uses the browser download URL, which has secret and unknown throttling characteristics. This was what I was hoping to learn from the GitHub contact person in the first place, and he only confirmed that they're not going to explain it publicly. There's also the bit at the end about parallel downloads; it's quite possible that the secret throttling rules would trigger for (far) fewer than 60 simultaneous downloads, and in fact I think I've seen it do that. So unfortunately I'm not able to outline an effective workaround for end users, other than downloading mods one at a time. |
We'd want to be exceptionally careful about baking tokens into the client, firstly for them getting taken and abused by a third party and secondly it may be against the terms of service rotating the tokens to work around the limits. Probably in the interim we can start generating URLs that point to the API download and use some logic to use it instead, but I'm guessing the unauthenticated limit is quite low? We also have the client implementation part of #1682, which would allow for FOSS mods to fallback to the Internet Archive to get the download. Which is one of the driving factors in building that system. |
Ok, if I'm synthesizing the data above correctly, we should be doing this:
And as a separate exercise, implement a way for users to enter their own GitHub token for the GitHub downloads and throw a message advising of the option when we get a GitHub 403 error. |
Is the only reason to have a |
Actually, since we store the content type in the metadata as |
Actually, throw in some quality values to assert that we would prefer the specified content type over some arbitrary binary stream and we should have something that works for anything: |
Essentially it's because older clients will fail to download the API URLs, and the metadata is shared between old and new clients. If we switch all the URLs over in CKAN-meta, then everyone who hasn't upgraded to the newest CKAN will get download errors for all GitHub URLs, every time. |
The way that I would roll it out would be as follows:
One thing I've been meaning to do is add a |
Sounds like a good plan to me. We can probably add the Accept header in 1.24... |
Wait, if the spec version requirement prevents them from downloading, then what do steps 3-5 buy us as compared to just switching the URLs immediately? GitHub downloads break either way, but changing the URLs would actually fix something as well. |
I suppose we don't need step 5 and can do 3-7 pretty much simultaneously. |
OK, sorry, I forgot some pieces of the puzzle since writing this up. The |
Couldn't that be specially handled in the client by checking for the presence of a |
Yeah, that makes sense. In fact, we can have a Where would we store the tokens? |
Right now it would be the registry, which is where user configuration data is stored. I'm not really a fan of that, however, and have been meaning to replace it with a file-based solution with JSON files. |
Works as expected. |
Yeah, I realized that after I posted and deleted that comment. It does indeed work even in a patched CKAN. |
Not sure how feasible this is, but maybe offer the option for users to use their own API keys? |
#2263 will allow that. |
As of today, most CKAN users will be on 1.24, which supports API URLs. We can consider migrating the metadata as we please now. Asterisk: Mono users might still be able to run 1.22.6 and earlier, since it worked for me on Ubuntu even after #2293's GitHub Apocalypse. Maybe allow a few weeks for this group to upgrade. |
Summary of current status:
Best of all, everyone now has these improvements, because #2293 broke all older clients. Remaining to be done:
However, complaints about the throttling have largely gone quiet lately. It may be wise to leave this as-is for a while lest we spoil a calm situation. |
GitHub downloading needs a rewrite
(I debated whether to add this as a comment to #1817, but it seems like too much text and detail for that.)
Problems
Currently if CKAN downloads many files from GitHub at the same time, they often fail with HTTP status code 403-Forbidden. #1817 contains an example, but these reports are common and I've definitely seen it happen myself several times.
Background
The GitHub API uses 403 codes for throttling; you get 60 unauthenticated requests per hour, and any beyond that return a 403. I encountered this while working on an unrelated project, and I had to use a GitHub token to allow 5000/hour, passed in the HTTP request headers:
Authorization: token <OAuth token here>
Currently CKAN's downloads do not go through the GitHub API, so this does not necessarily indicate exactly what's going on with them. However, it establishes that 403-Forbidden is sometimes used for throttling, and it becomes more relevant later in discussion of the API.
Sample API data for releases, minus the author and uploader fields since they're long and not relevant to this issue:
The zip file that we want to download is associated with
assets[0]
, and there are two fields for it,url
andbrowser_download_url
. This becomes important later.Investigation summary
I used the "Contact GitHub" link to reach out to GitHub about how their download throttling works. Surprisingly, the person who replied understood exactly what I was talking about and how to fix it 👍. It turns out that these problems happen because CKAN is not using GitHub as intended. From my conversation with the very helpful support person:
("good citizen" was my phrasing in my original message, so don't take that as an unprovoked criticism of our civic virtues.)
Key points:
The URL from the field we're using currently (
browser_download_url
) is for users and browsers only, not applications. It can be throttled, but there is no explicit policy or workaround.We should be using the GitHub API for downloads. Currently we use it in the Netkan code that finds new releases, but for downloads we effectively impersonate a browser.
This can be done by requesting the
url
field instead ofbrowser_download_url
and setting a custom HTTP header:Accept: application/octet-stream
I tested this with wget, and setting the Accept header did indeed give me the download. Without this header, it returns a JSON object describing the asset.
Changes needed to stop abusing
browser_download_url
GitHub-specific downloading metadata & logic
When downloading from GitHub, we need to send the custom HTTP header. This cannot be accomplished simply by swapping out the bad URL for the good URL in the
download
metadata field.Proposed new metadata field:
github_download
- Theassets[0].url
value from the APISpecific changes:
UI to handle 403 statuses
If a GitHub download returns a 403 status, we should handle the exception and notify the user that their downloads are being throttled. We could direct them to the setting (see below) and web page dealing with GitHub auth tokens, and/or advise them to wait 60 minutes for their limit to reset. https://api.github.com/rate_limit can be used to get the exact limit and timing numbers.
GitHub token handling
Users will be limited to 60 GitHub downloads per hour, because this is the limit of the GitHub API. 140+ mod installs are pretty commonly mentioned on the forums, and reinstalling everything from scratch is a common method for dealing with compatible upgrades, so some users would probably encounter this limit and not appreciate the 60-minute wait to be able to download more. The only way around this is to use a GitHub auth token, which boosts the limit to 5000/hour per token.
It would be nice to ship a single internal auth token for all of CKAN, since then users would have the 5000/hour limit by default without having to worry about any of the details. More responses from the GitHub contact person:
Deliberate abuse like that is unlikely, but assuming 200 downloads per active user per hour, a 5000/hour limit across all CKAN users would support 25 active users in a given hour. The number of active users at a given time isn't known, but the latest CKAN release has over 60000 downloads, so it's probably more than 25. If we were able to determine the limit we needed per hour, we could divide it by 5000 and then generate that many tokens and pick one randomly per request, but that might not be in the spirit of the API's rules.
A setting
We could create a new settings field called GitHub Auth Token, where the user could fill in their own tokens to allow more downloads. This could be instead of or in addition to any built-in tokens we may or may not use, and it should support all the UIs.
Multi-pass approach
Migration concerns
If Netkan was updated to use this new scheme tomorrow, current CKAN clients would break unless the old
download
field was still populated. So we should not remove support for the old metadata immediately; GitHub downloads should use bothdownload
andgithub_download
until all clients are updated.Or just download serially
The API docs say:
So even with a token, CKAN's parallel download method would still be in violation of the letter of the law.
As a halfway measure, we could try scaling back the parallelization of downloads.
This might solve the problem without messing with all the API/token stuff. We would still technically be misusing GitHub, but users should no longer encounter failed downloads as frequently.
The text was updated successfully, but these errors were encountered: