Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Waiting for img.shields.io" #445

Closed
weitjong opened this issue May 13, 2015 · 73 comments
Closed

'Waiting for img.shields.io" #445

weitjong opened this issue May 13, 2015 · 73 comments
Labels
operations Hosting, monitoring, and reliability for the production badge servers

Comments

@weitjong
Copy link

After changing our project website to use shields.io, it frequently has long wait to render the badges. When we have problem in the badge rendering on our website, the http://shields.io/ also has the similar problem. So, I guess the problem is not on our side.

Is there any way to improve this? I am considering to switch back to use the original badges from their respective service providers. It is better to get a reliable low resolution version than an unreliable high resolution badge.

@espadrine
Copy link
Member

I see you are using Coverity and Travis badges. I'll try to monitor their response times. Could you give me information about when they are not on par with what we should expect?

@weitjong
Copy link
Author

Thanks for the prompt reply, much appreciated. I am sorry that I did not note down the exact time when it happened. What I can confirm is that when it did happen, the sample badges on the http://shields.io/ were also not rendered properly. I may also add that the outage did not take long to recover. It is the outage frequency that worries me. I happened to catch it with the pant down three or four times already since we switch to shields.io.

@weitjong
Copy link
Author

It is happening now.

@weitjong
Copy link
Author

It seems it has recovered itself again.

@espadrine
Copy link
Member

Thanks a lot for reporting. There seems to have been a set of sudden surges in request frequency past 400 per seconds (from a normal 40) from servers at Amazon AWS. It has caused a lot of request time-outs, and has made Redis fail, which caused a marginal number of recovered crashes (about 50).

There was also a Heroku slow-down at the same time caused by our using slighly over 512MB of memory, which made things worse.

I think I plan on switching away from Heroku. We can hopefully use a lot more memory and take surges like these with a better infrastructure.

heroku

@weitjong
Copy link
Author

Thanks for looking into this.

@untitaker
Copy link

I frequently experience this with the Gratipay shields.

@ionelmc
Copy link

ionelmc commented Jun 18, 2015

This issue still happens

@stephnr
Copy link

stephnr commented Jun 18, 2015

It is happening currently with rendering badges

@weitjong
Copy link
Author

This may sound stupid (or arrogant depend on how you read it), but I was wondering whether this issue exists before I raised it. More precisely, did it exist before our project website switch to use shields.io badges? Or it it just because I am the first who bother to report it?

I would not reveal the link to our project website here to avoid the impression that I am promoting it but I believe it should contribute to quite an amount of workload on your server because:

  • Our website has picked up a significant traffic recently.
  • The badges are located in our default footer page template, i.e. the badges appear in the footer of all the generated pages, including our documentation pages.
  • Before the switch, we have employed a small js to force browser to always download Travis-CI build status badges remotely instead of from local cache. This javascript is still in place after the switch.

With all these combined together, it means it does not matter where our website visitors navigate to, each page being rendered would generate a small workload on your server. The accumulated workloads could become significant depends on how well your server scale up on demand. Hence I am wondering could it be our own doing that cause this in the first place! I hope I am worrying too much tough.

@espadrine
Copy link
Member

@weitjong Based on my server logs, there is no single website generating a majority of the traffic, although GitHub is the most significant. Issues related to the responsiveness of the server have existed as long as we have used the current configuration (see eg. #226), and have been solved by caches and algorithms so far. The main issue is that, as far as I know, the Heroku system doesn't give a way to the server software to detect when it enters slowdown mode. I will change server setup, however.

@weitjong
Copy link
Author

Thanks for your prompt reply. Again much appreciated.

That is quite a relief to hear. I just want to come clean. 😄
Our website is hosted on github.io too. But I guess this statement does not change anything as you said the issue exists long before.

@ms-ati
Copy link

ms-ati commented Jun 24, 2015

Just to add: the badges are still slow.

@mjackson
Copy link

I'm seeing timeouts on my Travis and npm badges today at https://github.com/rackt/history.

@weitjong
Copy link
Author

The problem seems to be getting worse. I have temporarily switched our website to use back the lower res badge from its original source instead of from shields.io. It probably does not make any differences to shields.io performance, but who knows.

@espadrine
Copy link
Member

Indeed, it won't make a difference. Unfortunately, the server became unreachable at the worst possible time — just as I started going to sleep. I'm investigating the issue with my hosting provider.

@espadrine
Copy link
Member

I have just rebooted the server, and things are running again.

@ms-ati
Copy link

ms-ati commented Sep 15, 2015

@espadrine have you considered hosting the images on S3 behind CloudFront? Couldn't you cache them there, if not actually render them to their as a static site as the primary representation?

@espadrine
Copy link
Member

@ms-ati There is a difference between having slow badges, having downtime, and having incorrect badges.

I strive to produce badges that are as correct as possible, as fast as possible, and with as little downtime as possible.

I switched hosting providers a week ago to fix the speed issue; badges should now be exactly as slow as the service that produces the information they provide (plus network lag, which is generally negligible). More importantly, while before the server entered a severe slowdown for one hour every week at peak times, that should no longer happen. So far, that seems accurate.

Today, the VPS went down and did not restart, for which I am talking to the provider. Having a CloudFront cache would not help a server that is not running on a machine that is not up to serve images.

Cache really is not the issue right now. I have a cache that I use when vendors (Travis CI, etc) do not respond, when data changes rarely, or when the server receives duplicate requests in rapid fire.

So: are the badges still slow for you?

@gnzlbg
Copy link

gnzlbg commented Sep 18, 2015

Is it down again? :)

@ms-ati
Copy link

ms-ati commented Sep 18, 2015

@espadrine Shouldn't a CloudFront or Fastly cache, simply in front of the badges system as a whole, smooth over any temporary outages? In other words, isn't HTTP caching itself a very well suited mechanism for increasing the availability of the badge urls?

@espadrine
Copy link
Member

Yes, the server went dark again. I'm getting annoyed at OVH. I'd rather not spend another week of holidays setting things up yet again on a new server (say Digital Ocean), but two downtimes a week is obviously inacceptable. I sent them a mail, I will see what they say.

@ms-ati I use HTTP caching. Obviously, most people want badges to have accurate information, though. It is irrelevant anyway when you can't even access the IP that the DNS points your browser to.

@tankerkiller125
Copy link

@espadrine Would it make more sense to use AWS and automatic scaling load balanced instances? Thus handling large sums of traffic quickly and easily according to your auto scaling policy. So that when say 80% of CPU is being used AWS will automatically spin up an exact copy of that instance and start sending traffic to it and the other instance?

@ms-ati
Copy link

ms-ati commented Oct 5, 2015

Would it also make sense to put a CDN in front, which is configured to
successfully return the last value when the origin is unreachable? This
would seem a good case for that?

On Mon, Oct 5, 2015 at 10:29 AM, tankerkiller125 notifications@github.com
wrote:

@espadrine https://github.com/espadrine Would it make more sense to use
AWS and automatic scaling load balanced instances? Thus handling large sums
of traffic quickly and easily according to your auto scaling policy. So
that when say 80% of CPU is being used AWS will automatically spin up an
exact copy of that instance and start sending traffic to it and the other
instance?


Reply to this email directly or view it on GitHub
#445 (comment).

Marc Siegel
Technology Specialist
American Technology Innovations

Email: marc@usainnov.com
Phone: 1 (617) 399-8145
Cell: 1 (617) 223-1220
Fax: 1 (617) 334-7975

@espadrine
Copy link
Member

@tankerkiller125 Shields.io is not CPU-bound.

@ms-ati What is the difference between a CDN that returns the last value when the origin is unreachable, and a cache that does the same thing? We currently have the latter.

@ms-ati
Copy link

ms-ati commented Oct 5, 2015

What is the difference between a CDN that returns the last value when the origin
is unreachable, and a cache that does the same thing? We currently have the latter.

@espadrine Good question! I believe the difference is in the area of robustness. Or downtime, like this ticket discusses.

Using a cache as we have today impacts performance, and it may provide robustness against downtime of the data sources. However, as this ticket attests, it doesn't provide robustness for the shields.io service endpoints itself.

I think that using a CDN "in front" of shields.io, which is configured to return successfully with the last value for any url it has cached when the origin is down, will provide robustness to the service itself being down.

That way, badges that have previously been requested will continue to appear, and just won't be updated until the service is back up.

@tankerkiller125
Copy link

@hotrush I think the load that the servers are under is causing the request to API's to occur to slowly for the software. Causing the vendor unresponsive errors.

@espadrine
Copy link
Member

Note: I changed to having two servers with DNS round-robin. We'll see if there is some improvements.

@patrikhuber
Copy link

My badges still don't load (see my post & repo link above), do you have any idea what I could try?

@pkoretic
Copy link

@patrikhuber @espadrine

I have the same problem, seems that github link for license badge inside readme doesn't work for some time now
replacing it with direct link works

[![License](http://img.shields.io/:license-mit-blue.svg)](http...)

@patrikhuber
Copy link

I just checked and was very surprised to see the license and version badges load a few minutes ago. I refreshed a few times and it was working each time.

Now, 3 minutes or so later, the release badge is back to not loading.

@espadrine
Copy link
Member

@patrikhuber That's probably GitHub's rate limit on one server kicking in. I suggest using https://img.shields.io/badge/license-mit-blue.svg?style=flat-square instead. That will essentially instantaneously load.

@algorys
Copy link

algorys commented Feb 22, 2016

Hello,

I've the same issue with my badges.

Travis badge working well, other (Github Badge : License and Releases) sometimes appears, sometimes not.

https://github.com/algorys/agshmne/

@espadrine I've tried your solution. That's working great but maybe too complicated to maintain for large project

@patrikhuber
Copy link

@espadrine Thanks. Is there a workaround/solution for the release version badges?

Are all shields users experiencing these problems, and if not, why only a handful of us? Why does this rate limit not kick in for others?

@algorys
Copy link

algorys commented Feb 25, 2016

@patrikhuber you can make the same with your release but you have to change manually for each versions / tags :

[![GitHub release](https://img.shields.io/badge/release-v0.0.3-blue.svg)](https://github.com/algorys/agshmne/releases/latest)

But as I said before, if you release often it'll be hard to maintain.

@espadrine
Copy link
Member

@patrikhuber I'm exchanging with GitHub to figure out a solution. The main issue is here: #529. It does affect everyone, although we do have two servers, so two IPs, presumably treated separately.

@patrikhuber
Copy link

@algorys I know, but that's not really a great option unfortunately.

@espadrine Cool, that's great to hear! Awesome. And thanks to the link to #529. It just seems a bit disappointing that the conversation hasn't advanced since Sept 2015. Anyway, glad to see I'm not the only one having this issue, and I hope there will be a solution soon.

@marcelstoer
Copy link

I've just gone through the comments in this issue and those in its sibling #529. As a result I also contributed tokens using https://img.shields.io/github-auth.

What I don't understand is why GitHub would switch img src URLs from https://img.shields.io/ with those from https://camo.githubusercontent.com/ at all. After all, it'd be the browser making those requests and thus hit shields.io. Why would GitHub want to redirect (and cache) those? To me that makes particularly little sense for "static" badges (i.e. those that don't need to access GitHub resources to render) such as license or Twitter.

@espadrine
Copy link
Member

@marcelstoer Originally, the point was to avoid mixed-content warnings which browsers raise when a HTTPS page has resources (eg. images) fetching data over insecure HTTP. You can read more on the subject on their README.

@marcelstoer
Copy link

marcelstoer commented Jan 20, 2017

Ahh, right... "mixed-content warnings", forgot about those - except that shields.io can be accessed over HTTPS as well.

@espadrine
Copy link
Member

@marcelstoer Camo is now also used to set CSP information, which they rely on to avoid a class of vulnerabilities, things that are part of XSS or CSRF.

@marcelstoer
Copy link

Would self-hosting be a sensible alternative to all of us loading from https://img.shields.io?

@pkoretic
Copy link

@marcelstoer I've put my static badges to separate branch and linked that over rawgit service after too much trouble with img.shields.io, it's just not available most of the time
I can confirm it works without issues, example: https://github.com/qaap/recurse

@marcelstoer
Copy link

@pkoretic I had already planned to do that for static badges as well. The majority of badges are dynamic, though (i.e. the make some API calls).

@pkoretic
Copy link

@marcelstoer yeah, I ignored them for now, probably the best is to add reverse caching proxy in front of them for your usage

@marcelstoer
Copy link

I believe there's a misunderstanding, either on your side or on mine 😉 What I meant was to host this project myself at let's say https://shields.mydomain.io. Then in my READMEs I'd use that domain. GitHub would still route them through camo but those requests wouldn't count against the https://img.shields.io rate limit.

@pkoretic
Copy link

@marcelstoer I understand, but it seems to much work for just badges
it would be better if we could give more servers into the pool, since I too have spare dedicated servers which could take on that load

Thats why I recommended reverse caching proxy (using nginx for example)

That way you can still use your domain https://shields.mydomain.io which is proxy to https://img.shields.io so when https://img.shields.io is not available you will still get cached results instead of hanging

@espadrine
Copy link
Member

The load has increased quite a bit lately: it averages 170 req/s at peak time. I'll add a server to the pool (I currently rely on two servers).

GitHub rate limits are not the issue anymore.

@tankerkiller125
Copy link

@espadrine There is a module for Nginx that allows the load balancing feature to be controlled by a rest interface. And you could do an HTTP check on the servers like every 20s or so. This would allow people with additional servers to help out with handling the load with the minimal configuration for you to handle. Of course, this would need some authentication so that people can add and remove servers securely.

@paulmelnikow paulmelnikow added the operations Hosting, monitoring, and reliability for the production badge servers label Apr 17, 2017
@paulmelnikow
Copy link
Member

@espadrine Is there any action to take here?

@espadrine
Copy link
Member

@paulmelnikow I believe the performance of the servers is more suitable nowadays.

@badges badges locked and limited conversation to collaborators Sep 6, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
operations Hosting, monitoring, and reliability for the production badge servers
Projects
None yet
Development

No branches or pull requests