-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the readiness checks for functions #249
Conversation
Thank you for your contribution. I've just checked and your commit doesn't appear to be signed-off. |
02fc6ee
to
c19eefa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Berndt thanks for your patch, just a few tweaks needed before merge. If making the probe type is more work than you have time for maybe it can be done in a follow-up PR? Thanks, Alex
handlers/deploy.go
Outdated
probe := &apiv1.Probe{ | ||
Handler: apiv1.Handler{ | ||
Exec: &apiv1.ExecAction{ | ||
Command: []string{"cat", path}, | ||
HTTPGet: &apiv1.HTTPGetAction{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't be turned on by default, it needs to be optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I had thought based on our off-line conversation this wasn't the case. Easy to make optional
handlers/deploy.go
Outdated
}, | ||
}, | ||
InitialDelaySeconds: 3, | ||
InitialDelaySeconds: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please introduce a configuration item for this. You can largely copy and paste from the existing variables.
Should also be available via helm as an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
handlers/deploy.go
Outdated
TimeoutSeconds: 1, | ||
PeriodSeconds: 10, | ||
PeriodSeconds: 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be a configuration item with a default of the previous value for compatibility. When used in dispatch you'd just set your values via the helm chart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want the default to be the previous value(s)? These values will have far more positive effect than negative, and shouldn't break anything existing
Yes @alexellis I assumed any change to the actual probe, would be a separate PR |
a5214a5
to
e33512a
Compare
Pretty much exposed everything. Let me know if you think this is going a bit far. Also, I left the default for the liveness probe the same as before, but the readiness probe has new values which make 0->1 scaling faster. |
Hi @berndtj that is very thorough work, thanks for taking time to think through the configuration options and for signing-off the PR. Here is what I was thinking: Since we use the same-point for liveness and readiness, they should always be enabled, but the question is which mode. Compatibility mode or http-mode?
Both are needed to keep compatibility with existing functions, that's why the option is needed. Given a value of
https://github.com/openfaas/faas/blob/master/watchdog/main.go#L53 faas-netes and the gateway expose health via If you are not using the watchdog and don't want to expose your health endpoint via Given a value of I think we could de-duplicate the options in the PR and use the same values for timeout / period checking and initial check for both liveness and readiness. At this stage they point at the same endpoint and react in the same way. What are your thoughts on above? |
That's kind of embarrassing (/healthz vs /_/health). I'm surprised it still passes readiness/liveness. I'm not sure the value needs to be configurable necessarily. Anyway on to your other points. Yeah, I forgot about "compatiblity" mode, that's easy enough. I explicitly did not dedupe the probes as I figured you actually want different values for liveness and readiness even if the endpoint is the same. For instance, I can live with a much longer period with liveness, but I want as short as possible for readiness. |
e33512a
to
8e9ceb8
Compare
Ok, updated based on comments. Probe is always on and defaults to http, but can be configured for lock. Also did a bit of deduping of code where applicable. Lastly... I see errors occasionally with regards to cold start/first call:
I don't believe this has anything to do with this change (we've seen the same error within Dispatch when using OpenFaaS). It's something that should probably be addressed separately. |
@berndtj on the last comment I have a question. When do you see that issue? Is it specifically when scaling 0 to 1 or at other times? |
Yes, I only see it scaling from 0->1 (but to be honest I'm not testing subsequent requests). We've also seen it with Dispatch and openfaas when we are waiting on the function to become ready. It's likely the same issue. I actually have a change to the actual readiness check that doesn't rely on the lock file at all. I'll give that a test. I think it actually does fix the issue. |
@@ -113,6 +113,20 @@ spec: | |||
value: "{{ .Values.faasnetesd.writeTimeout }}" | |||
- name: image_pull_policy | |||
value: {{ .Values.faasnetesd.imagePullPolicy | quote }} | |||
- name: http_probe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, these will need to be added to the README.md in the chart to show what values are valid and what they mean.
I think we should have defaults over there.
We also need to deploy via plain YAML via the ./yaml/ folder, so I imagine this needs updating too? That or sane (existing) defaults have to be added to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just seen the defaults in the code, if the defaults work well then we could update the YAML later.) Best way to test is to kubectl delete
the two OpenFaaS namespaces, then apply the YAML folder again.
types/read_config.go
Outdated
WriteTimeout time.Duration | ||
ImagePullPolicy string | ||
Port int | ||
HTTPProbe bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this might be useful to comment on:
// HTTPProbe when set to true switches readiness and liveness probe to access /_/health over HTTP instead of accessing /tmp/.lock.
types/read_config.go
Outdated
HTTPProbe bool | ||
ReadinessProbeInitialDelaySeconds int | ||
ReadinessProbeTimeoutSeconds int | ||
ReadinessProbePeriodSeconds int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if this is worth making a Golang duration in this PR or a follow-up?
The other configs support Golang durations now, could call durationVal.Seconds()
in the code to convert if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't consider this as compulsory - just want your take on it.
I hadn't even considered the yaml ;). I'll make sure and test first |
Handler: apiv1.Handler{ | ||
var handler apiv1.Handler | ||
|
||
if config.HTTPProbe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good for now but in the future we should have a way to switch this flag from the function definition so that is backwards compatible with functions that are built with the old watchdog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, we could use the new annotations field being worked on by @ewilde
ReadinessProbePeriodSeconds int | ||
LivenessProbeInitialDelaySeconds int | ||
LivenessProbeTimeoutSeconds int | ||
LivenessProbePeriodSeconds int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make these of type time.Duration
but we can address this at a later time.
8e9ceb8
to
ed8b6d3
Compare
Not going to be popular for saying this, but we've had some Chart changes merged since the PR. This generally means resetting the commit, rebasing the chart then running make charts again before doing a commit with a force. Other than that LGTM. Alex |
Hi Berndt I know you have time away coming up, all I could do at this point is to take your commit, reset it, fix it and add it back again but it would lose your authorship. I could perhaps set the "git author" but it won't look like it does now in the history. Alex |
I can fix it up right now. |
* Significantly improves scale up time for functions (when going from 0 -> 1) * Health check is hit more frequently, but should not noticibly impact performance * Use the httpget probe type and leverage the watchdog /healthz endpoint * Make all probe attributes configurable in charts This could be optimized a little further if new image for doing the http probes where created which would block on connection errors and return immediately when the response comes back, but the best case is < 1s improvement. Some performance numbers. Before (there was a timeout error): cold start: 10.240251064300537 error calling function: Command 'echo -n "Test" | faas-cli -g http://192.168.64.78:31112 invoke hello-python' returned non-zero exit status 1. cold start: 4.621361255645752 cold start: 5.6364970207214355 cold start: 11.648431777954102 cold start: 8.450724840164185 cold start: 9.854270935058594 cold start: 12.048357009887695 cold start: 12.24026870727539 After: cold start: 1.8590199947357178 cold start: 1.8544681072235107 cold start: 2.065181016921997 cold start: 1.8414137363433838 cold start: 1.6598482131958008 cold start: 2.4577977657318115 cold start: 2.4510068893432617 cold start: 2.244048833847046 cold start: 2.6444039344787598 Signed-off-by: Berndt Jung <bjung@vmware.com>
ed8b6d3
to
d451c1e
Compare
The following commit did not update tests and it seems the Dockerfile / CI was not running them either, found by Lucas. Error in: aa04e3e Tested with: - go test ./test - make Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
We just discovered the tests were broken in this commit, fixed in #249. |
The following commit did not update tests and it seems the Dockerfile / CI was not running them either, found by Lucas. Error in: aa04e3e Tested with: - go test ./test - make Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
Hi @berndtj, Do you have plans to add /_/health endpoint in https://github.com/openfaas-incubator/of-watchdog functions? |
This is the wrong repo for the question. There's already a disk-based health check with s http one in progress - openfaas/of-watchdog#13 |
from 0 -> 1)
impact performance
endpoint
This could be optimized a little further if new image for doing
the http probes where created which would block on connection errors
and return immediately when the response comes back, but the best
case is < 1s improvement.
Some performance numbers. Before (there was a timeout error):
cold start: 10.240251064300537
error calling function: Command 'echo -n "Test" | faas-cli -g http://192.168.64.78:31112 invoke hello-python' returned non-zero exit status 1.
cold start: 4.621361255645752
cold start: 5.6364970207214355
cold start: 11.648431777954102
cold start: 8.450724840164185
cold start: 9.854270935058594
cold start: 12.048357009887695
cold start: 12.24026870727539
After:
cold start: 1.8590199947357178
cold start: 1.8544681072235107
cold start: 2.065181016921997
cold start: 1.8414137363433838
cold start: 1.6598482131958008
cold start: 2.4577977657318115
cold start: 2.4510068893432617
cold start: 2.244048833847046
cold start: 2.6444039344787598
Description
Motivation and Context
Fix #218
How Has This Been Tested?
Types of changes
This only breaks very old functions which use a version of the watchdog which does not have a /healtz endpoint
Checklist:
git commit -s