Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaled instances registered before active health checks take into effect #16

Closed
adybuxton opened this issue Jan 18, 2019 · 3 comments
Labels
question An issue asking a question

Comments

@adybuxton
Copy link

We have an issue where instances are being added to the upstream lists before they are actually ready. The active NGinx health checks that run every 5 seconds, are ignored for roughly the first 30 seconds after the instance has been added. This results in the services timing out for 30 secs until the active health check marks the new instance as 'down' whilst chef installs the services.

We've noticed that this solution also adds auto scaled instances regardless of state, so by setting a lifecyle hook to mark the instance as initially pending and then allowing it timing out after several minutes to the InService state (whilst the services are provisioning in the background), still results in the new instance being added immediately when a new instance spins up.

Modifying the sync interval doesn’t make any difference in this scenario as an instance could be added towards the end of the sync interval and still show up immediately in the upstream list.

Is this a limitation of the service or there other approaches to mitigate this? Is there a reason why there is a large delay before active health checks mark the services as down?

@pleshakov pleshakov added the question An issue asking a question label Jan 18, 2019
@pleshakov
Copy link
Contributor

@adybuxton
health checks take into consideration the connection and other timeouts related to establishing a connection and reading/sending response to a backend instance. Even though when the health check interval is 5 seconds, if the value of the timeout is bigger than 5s, the first health check for an unavailable instance will not fail until the timeout expires. So to make sure a health check fails fast, you can decrease the value of the connection and other timeouts. For example:

        proxy_connect_timeout 5s;
        proxy_read_timeout 5s;
        proxy_send_timeout 5s;

Additionally, you can tell NGINX Plus to not consider any new added instance as healthy until the first health check passes. For this, use the mandatory parameter when defining a health check -- http://nginx.org/en/docs/http/ngx_http_upstream_hc_module.html#health_check

Here is an example that uses low timeout values and the mandatory parameter. Please note that health checks can be put into a different internal location for convenience:

upstream webapp1 {
    zone webapp1 64k;
    state /var/lib/nginx/state/webapp1.conf;
}

    location /webapp1 {
        proxy_pass http://webapp1;
    }

    location @hc-webapp1 {
        internal;
        proxy_connect_timeout 1s;
        proxy_read_timeout 1s;
        proxy_send_timeout 1s;

        proxy_pass http://webapp1;
        health_check interval=1s mandatory;
    }

Is this a limitation of the service or there other approaches to mitigate this? Is there a reason why there is a large delay before active health checks mark the services as down?

yes, nginx-asg-sync doesn't take into consideration the state of an instances of a Auto Scaling group. However, to make sure that an unhealthy instance is never started being used by NGINX Plus, you can use mandatory health checks as described above.

@adybuxton
Copy link
Author

Thanks, i'll take a look. One thing that may be of additional value and flexibility is to allow instances with specific lifecycle hook states to be filtered from the returned instance list, for instance Pending:Wait to cover situations when the instance is told to wait for a finite time (ie for provisioning) before it transitions into the InService state.

@pleshakov
Copy link
Contributor

implemented in #39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question An issue asking a question
Projects
None yet
Development

No branches or pull requests

2 participants