Reduce the readiness checks for functions #249

berndtj · 2018-07-16T21:58:42Z

Significantly improves scale up time for functions (when going
from 0 -> 1)
Health check is hit more frequently, but should not noticibly
impact performance
Use the httpget probe type and leverage the watchdog /healthz
endpoint

This could be optimized a little further if new image for doing
the http probes where created which would block on connection errors
and return immediately when the response comes back, but the best
case is < 1s improvement.

Some performance numbers. Before (there was a timeout error):

cold start: 10.240251064300537
error calling function: Command 'echo -n "Test" | faas-cli -g http://192.168.64.78:31112 invoke hello-python' returned non-zero exit status 1.
cold start: 4.621361255645752
cold start: 5.6364970207214355
cold start: 11.648431777954102
cold start: 8.450724840164185
cold start: 9.854270935058594
cold start: 12.048357009887695
cold start: 12.24026870727539

After:

cold start: 1.8590199947357178
cold start: 1.8544681072235107
cold start: 2.065181016921997
cold start: 1.8414137363433838
cold start: 1.6598482131958008
cold start: 2.4577977657318115
cold start: 2.4510068893432617
cold start: 2.244048833847046
cold start: 2.6444039344787598

Description

Motivation and Context

Fix #218

I have raised an issue to propose this change (required)

How Has This Been Tested?

import requests
import subprocess
import time

for i in range(10):
    now = time.time()
    try:
        subprocess.check_output('echo -n "Test" | faas-cli -g http://192.168.64.78:31112 invoke hello-python', shell=True)
        print("cold start: %s" % (time.time() - now))
    except Exception as e:
        print("error calling function: %s" % e)
    resp = requests.post("http://192.168.64.78:31113/system/scale-function/hello-python", json={"serviceName": "hello-python", "replicas": 0})
    time.sleep(10)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

This only breaks very old functions which use a version of the watchdog which does not have a /healtz endpoint

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I've read the CONTRIBUTION guide
I have signed-off my commits with git commit -s
I have added tests to cover my changes.
All new and existing tests passed.

derek · 2018-07-16T21:58:45Z

Thank you for your contribution. I've just checked and your commit doesn't appear to be signed-off.
That's something we need before your Pull Request can be merged. Please see our contributing guide.

alexellis

Hi Berndt thanks for your patch, just a few tweaks needed before merge. If making the probe type is more work than you have time for maybe it can be done in a follow-up PR? Thanks, Alex

alexellis · 2018-07-16T22:00:41Z

handlers/deploy.go

 	probe := &apiv1.Probe{
 		Handler: apiv1.Handler{
-			Exec: &apiv1.ExecAction{
-				Command: []string{"cat", path},
+			HTTPGet: &apiv1.HTTPGetAction{


This can't be turned on by default, it needs to be optional.

Oh, I had thought based on our off-line conversation this wasn't the case. Easy to make optional

alexellis · 2018-07-16T22:01:33Z

handlers/deploy.go

 			},
 		},
-		InitialDelaySeconds: 3,
+		InitialDelaySeconds: 0,


Please introduce a configuration item for this. You can largely copy and paste from the existing variables.

Should also be available via helm as an option.

alexellis · 2018-07-16T22:02:41Z

handlers/deploy.go

 		TimeoutSeconds:      1,
-		PeriodSeconds:       10,
+		PeriodSeconds:       1,


This should also be a configuration item with a default of the previous value for compatibility. When used in dispatch you'd just set your values via the helm chart

Are you sure you want the default to be the previous value(s)? These values will have far more positive effect than negative, and shouldn't break anything existing

berndtj · 2018-07-16T22:17:22Z

Yes @alexellis I assumed any change to the actual probe, would be a separate PR

berndtj · 2018-07-16T23:46:07Z

Pretty much exposed everything. Let me know if you think this is going a bit far. Also, I left the default for the liveness probe the same as before, but the readiness probe has new values which make 0->1 scaling faster.

alexellis · 2018-07-17T10:18:09Z

Hi @berndtj that is very thorough work, thanks for taking time to think through the configuration options and for signing-off the PR.

Here is what I was thinking:

Since we use the same-point for liveness and readiness, they should always be enabled, but the question is which mode. Compatibility mode or http-mode?

   probe_type: http

   probe_type: lock

Both are needed to keep compatibility with existing functions, that's why the option is needed.

Given a value of http then the /_/health endpoint should be queried (as defined in the watchdog). The OpenFaaS watchdog uses a prefix to avoid any clashing of function endpoints:

/_/health

https://github.com/openfaas/faas/blob/master/watchdog/main.go#L53

faas-netes and the gateway expose health via /healthz because they are not functions, but services.

If you are not using the watchdog and don't want to expose your health endpoint via /_/health then perhaps this should be configurable in the helm chart, for your use only?

Given a value of lock then the existing code should still run and the http probe should not be added.

I think we could de-duplicate the options in the PR and use the same values for timeout / period checking and initial check for both liveness and readiness. At this stage they point at the same endpoint and react in the same way.

What are your thoughts on above?

berndtj · 2018-07-17T16:43:46Z

That's kind of embarrassing (/healthz vs /_/health). I'm surprised it still passes readiness/liveness. I'm not sure the value needs to be configurable necessarily.

Anyway on to your other points. Yeah, I forgot about "compatiblity" mode, that's easy enough.

I explicitly did not dedupe the probes as I figured you actually want different values for liveness and readiness even if the endpoint is the same. For instance, I can live with a much longer period with liveness, but I want as short as possible for readiness.

berndtj · 2018-07-18T00:24:15Z

Ok, updated based on comments. Probe is always on and defaults to http, but can be configured for lock. Also did a bit of deduping of code where applicable.

Lastly... I see errors occasionally with regards to cold start/first call:

2018/07/18 00:16:00 error with upstream request to: /function/hello-python, Post http://hello-python.openfaas-fn.svc.cluster.local.:8080/function/hello-python: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

I don't believe this has anything to do with this change (we've seen the same error within Dispatch when using OpenFaaS). It's something that should probably be addressed separately.

alexellis · 2018-07-19T14:28:13Z

@berndtj on the last comment I have a question.

When do you see that issue? Is it specifically when scaling 0 to 1 or at other times?

berndtj · 2018-07-19T16:19:10Z

Yes, I only see it scaling from 0->1 (but to be honest I'm not testing subsequent requests). We've also seen it with Dispatch and openfaas when we are waiting on the function to become ready. It's likely the same issue.

I actually have a change to the actual readiness check that doesn't rely on the lock file at all. I'll give that a test. I think it actually does fix the issue.

alexellis · 2018-07-19T16:27:38Z

chart/openfaas/templates/gateway-dep.yaml

@@ -113,6 +113,20 @@ spec:
          value: "{{ .Values.faasnetesd.writeTimeout }}"
        - name: image_pull_policy
          value: {{ .Values.faasnetesd.imagePullPolicy | quote }}
+        - name: http_probe


Hi, these will need to be added to the README.md in the chart to show what values are valid and what they mean.

I think we should have defaults over there.

We also need to deploy via plain YAML via the ./yaml/ folder, so I imagine this needs updating too? That or sane (existing) defaults have to be added to the code.

(Just seen the defaults in the code, if the defaults work well then we could update the YAML later.) Best way to test is to kubectl delete the two OpenFaaS namespaces, then apply the YAML folder again.

alexellis · 2018-07-19T16:30:57Z

types/read_config.go

-	WriteTimeout                 time.Duration
-	ImagePullPolicy              string
-	Port                         int
+	HTTPProbe                         bool


Think this might be useful to comment on:

// HTTPProbe when set to true switches readiness and liveness probe to access /_/health over HTTP instead of accessing /tmp/.lock.

alexellis · 2018-07-19T16:31:46Z

types/read_config.go

+	HTTPProbe                         bool
+	ReadinessProbeInitialDelaySeconds int
+	ReadinessProbeTimeoutSeconds      int
+	ReadinessProbePeriodSeconds       int


Curious if this is worth making a Golang duration in this PR or a follow-up?

The other configs support Golang durations now, could call durationVal.Seconds() in the code to convert if that makes sense.

I don't consider this as compulsory - just want your take on it.

berndtj · 2018-07-19T16:35:11Z

I hadn't even considered the yaml ;). I'll make sure and test first

stefanprodan · 2018-07-19T16:58:24Z

handlers/deploy.go

-		Handler: apiv1.Handler{
+	var handler apiv1.Handler
+
+	if config.HTTPProbe {


This looks good for now but in the future we should have a way to switch this flag from the function definition so that is backwards compatible with functions that are built with the old watchdog.

For instance, we could use the new annotations field being worked on by @ewilde

stefanprodan · 2018-07-19T16:59:49Z

types/read_config.go

+	ReadinessProbePeriodSeconds       int
+	LivenessProbeInitialDelaySeconds  int
+	LivenessProbeTimeoutSeconds       int
+	LivenessProbePeriodSeconds        int


I would make these of type time.Duration but we can address this at a later time.

alexellis · 2018-07-20T14:52:45Z

Not going to be popular for saying this, but we've had some Chart changes merged since the PR.

This generally means resetting the commit, rebasing the chart then running make charts again before doing a commit with a force.

Other than that LGTM.

Alex

alexellis · 2018-07-21T08:34:53Z

Hi Berndt I know you have time away coming up, all I could do at this point is to take your commit, reset it, fix it and add it back again but it would lose your authorship. I could perhaps set the "git author" but it won't look like it does now in the history.

Alex

berndtj · 2018-07-21T15:09:23Z

I can fix it up right now.

* Significantly improves scale up time for functions (when going from 0 -> 1) * Health check is hit more frequently, but should not noticibly impact performance * Use the httpget probe type and leverage the watchdog /healthz endpoint * Make all probe attributes configurable in charts This could be optimized a little further if new image for doing the http probes where created which would block on connection errors and return immediately when the response comes back, but the best case is < 1s improvement. Some performance numbers. Before (there was a timeout error): cold start: 10.240251064300537 error calling function: Command 'echo -n "Test" | faas-cli -g http://192.168.64.78:31112 invoke hello-python' returned non-zero exit status 1. cold start: 4.621361255645752 cold start: 5.6364970207214355 cold start: 11.648431777954102 cold start: 8.450724840164185 cold start: 9.854270935058594 cold start: 12.048357009887695 cold start: 12.24026870727539 After: cold start: 1.8590199947357178 cold start: 1.8544681072235107 cold start: 2.065181016921997 cold start: 1.8414137363433838 cold start: 1.6598482131958008 cold start: 2.4577977657318115 cold start: 2.4510068893432617 cold start: 2.244048833847046 cold start: 2.6444039344787598 Signed-off-by: Berndt Jung <bjung@vmware.com>

The following commit did not update tests and it seems the Dockerfile / CI was not running them either, found by Lucas. Error in: aa04e3e Tested with: - go test ./test - make Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>

alexellis · 2018-08-03T08:30:29Z

We just discovered the tests were broken in this commit, fixed in #249.

The following commit did not update tests and it seems the Dockerfile / CI was not running them either, found by Lucas. Error in: aa04e3e Tested with: - go test ./test - make Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>

dkozlov · 2018-08-05T18:18:16Z

Hi @berndtj, Do you have plans to add /_/health endpoint in https://github.com/openfaas-incubator/of-watchdog functions?

alexellis · 2018-08-05T18:25:50Z

This is the wrong repo for the question. There's already a disk-based health check with s http one in progress - openfaas/of-watchdog#13

derek bot added the new-contributor label Jul 16, 2018

derek bot added the no-dco label Jul 16, 2018

berndtj force-pushed the reduce-readiness branch from 02fc6ee to c19eefa Compare July 16, 2018 22:03

derek bot removed the no-dco label Jul 16, 2018

alexellis requested changes Jul 16, 2018

View reviewed changes

berndtj force-pushed the reduce-readiness branch 2 times, most recently from a5214a5 to e33512a Compare July 16, 2018 23:44

berndtj force-pushed the reduce-readiness branch from e33512a to 8e9ceb8 Compare July 18, 2018 00:20

alexellis reviewed Jul 19, 2018

View reviewed changes

stefanprodan approved these changes Jul 19, 2018

View reviewed changes

berndtj force-pushed the reduce-readiness branch from 8e9ceb8 to ed8b6d3 Compare July 19, 2018 18:16

berndtj force-pushed the reduce-readiness branch from ed8b6d3 to d451c1e Compare July 21, 2018 15:13

alexellis merged commit aa04e3e into openfaas:master Jul 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the readiness checks for functions #249

Reduce the readiness checks for functions #249

berndtj commented Jul 16, 2018 •

edited

Loading

derek bot commented Jul 16, 2018

alexellis left a comment

alexellis Jul 16, 2018

berndtj Jul 16, 2018

alexellis Jul 16, 2018

berndtj Jul 16, 2018

alexellis Jul 16, 2018

berndtj Jul 16, 2018

berndtj commented Jul 16, 2018

berndtj commented Jul 16, 2018

alexellis commented Jul 17, 2018

berndtj commented Jul 17, 2018

berndtj commented Jul 18, 2018

alexellis commented Jul 19, 2018

berndtj commented Jul 19, 2018

alexellis Jul 19, 2018

alexellis Jul 19, 2018

alexellis Jul 19, 2018

alexellis Jul 19, 2018

alexellis Jul 19, 2018

berndtj commented Jul 19, 2018

stefanprodan Jul 19, 2018

alexellis Jul 20, 2018

stefanprodan Jul 19, 2018

alexellis commented Jul 20, 2018

alexellis commented Jul 21, 2018

berndtj commented Jul 21, 2018

alexellis commented Aug 3, 2018

dkozlov commented Aug 5, 2018

alexellis commented Aug 5, 2018

Reduce the readiness checks for functions #249

Reduce the readiness checks for functions #249

Conversation

berndtj commented Jul 16, 2018 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

derek bot commented Jul 16, 2018

alexellis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berndtj commented Jul 16, 2018

berndtj commented Jul 16, 2018

alexellis commented Jul 17, 2018

berndtj commented Jul 17, 2018

berndtj commented Jul 18, 2018

alexellis commented Jul 19, 2018

berndtj commented Jul 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berndtj commented Jul 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexellis commented Jul 20, 2018

alexellis commented Jul 21, 2018

berndtj commented Jul 21, 2018

alexellis commented Aug 3, 2018

dkozlov commented Aug 5, 2018

alexellis commented Aug 5, 2018

berndtj commented Jul 16, 2018 •

edited

Loading