child: use 127.0.0.1 for httpGet probes if `hostNetwork` field is `true` #434

ilyam8 · 2024-08-07T15:34:26Z

As suggested in https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#http-probes

Suppose the container listens on 127.0.0.1 and the Pod's hostNetwork field is true. Then host, under httpGet, should be set to 127.0.0.1.

cc @BirknerAlex

ilyam8 · 2024-08-07T15:36:11Z

Hey, @BirknerAlex. Do we really need httpGet.host?

BirknerAlex · 2024-08-07T16:00:43Z

In my case I kept hostNetwork: true because I thought its needed to crawl for network interface stats or something like that (monitoring stuff).

I only bound the web to 127.0.0.1 instead, so the PodIP is still the node IP (public, which is used for the probes as well), so if you ask me it's required to override it).

ilyam8 · 2024-08-07T16:07:10Z

In my case I kept hostNetwork: true because I thought its needed to crawl for network interface stats or something like that (monitoring stuff)

Yes. It is needed for:

Host system networking stack monitoring.
Monitoring applications running on the host (e.g. Kubelet).
Discovering all current network sockets and building a network-map (Top->Network connections).

I mean if we just do the following:

{{- if .Values.child.hostNetwork }}
  host: 127.0.0.1
{{- end}}

It will do, right? Is there a case when hostNetwork is false and you need to set host? I am thinking of removing host and just setting the host to 127.0.0.1 if hostNetwork is true.

BirknerAlex · 2024-08-07T16:11:44Z

Yes you're right, that should be fine as well instead of the additional value.

//edit: Okay I might found one reason to keep it, not in my use case but using hostNetwork: true and binding the web to a specific additional IP of the host (e.g. if the cluster has multiple IPs - e.g. one internal, one external) you maybe want to define the IP instead, but sure you could also additionally bind it to 127.0.0.1 as well. But that would only work on a single node cluster.

witalisoft · 2024-08-12T06:33:19Z

what about changing the type of the readiness/liveness probes to exec and simplifying it ?

-      livenessProbe:
-        failureThreshold: 3
-        httpGet:
-          path: /api/v1/info
-          port: http
-          scheme: HTTP
-        periodSeconds: 30
-        successThreshold: 1
-        timeoutSeconds: 1
+        livenessProbe:
+          failureThreshold: 3
+          exec:
+            command:
+            - /usr/sbin/netdatacli
+            - ping
+          periodSeconds: 30
+          successThreshold: 1
+          timeoutSeconds: 1

ilyam8 · 2024-08-12T07:44:00Z

@stelfrag can we use netdatacli ping/pong for liveness/readiness probes? Does a ping response mean Netdata is running and ready to serve requests?

stelfrag · 2024-08-14T09:52:34Z

@stelfrag can we use netdatacli ping/pong for liveness/readiness probes? Does a ping response mean Netdata is running and ready to serve requests?

The command processing (cli) is initialized last so receiving a pong means the agent is ready to respond to requests at that point -- otherwise you get

uv_pipe_connect(): no such file or directory
Make sure the netdata service is running.

This will be the case after startup and while the metrics database is initializing (if it is taking a while)

Requests (queries) to "children" may not be available even if you get a pong as those continue to initialize async and populate contexts etc.

ilyam8 · 2024-08-14T11:07:24Z

Requests (queries) to "children" may not be available even if you get a pong as those continue to initialize async and populate contexts etc.

This is also true for HTTP GET to /api/v1/info, right? So we can switch to netdata cli by default for both liveness and readiness probes.

witalisoft · 2024-08-14T11:19:03Z

@stelfrag

if the netdatacli ping returns:

uv_pipe_connect(): no such file or directory
Make sure the netdata service is running.

Is it also followed by the exit code other than 0 ?

ilyam8 · 2024-08-14T11:22:08Z

Yes

root@pve-deb-work:/opt/netdata/usr/sbin# ./netdatacli ping; echo $?
pong
0

root@pve-deb-work:/opt/netdata/usr/sbin# systemctl stop netdata
root@pve-deb-work:/opt/netdata/usr/sbin# ./netdatacli ping; echo $?
uv_pipe_connect(): no such file or directory
Make sure the netdata service is running.
255

ilyam8 · 2024-08-14T11:31:45Z

@witalisoft One caveat is that "timeoutSeconds" becomes almost useless because netdatacli ping returns immediately (HTTP request hangs for timeout seconds).

Startup time:

HTTP: periodSeconds * failureThreshold * timeoutSeconds.
CLI: periodSeconds * failureThreshold (timeoutSeconds times less when timeoutSeconds > 1).

witalisoft · 2024-08-14T11:42:50Z

@ilyam8

One caveat is that "timeoutSeconds" becomes almost useless because netdatacli ping returns immediately (HTTP request hangs for timeout seconds).

Startup time:

HTTP: periodSeconds * failureThreshold * timeoutSeconds.

CLI: periodSeconds * failureThreshold (timeoutSeconds times less when timeoutSeconds > 1).

so potentially the implication will be that we can get more often liveness/readiness failures when the netdata is slowly responding ?

ilyam8 · 2024-08-19T09:17:43Z

@witalisoft

We're comparing the time it takes for Kubernetes to determine an application is unhealthy based on HTTP and EXEC types of liveness/readiness probes.

When using an HTTP liveness probe, the time it takes for Kubernetes to consider the application unhealthy is significantly influenced by the timeoutSeconds parameter. This is because the HTTP request can hang for the full timeoutSeconds duration, especially during application startup or under heavy load.
In contrast, when using an EXEC liveness probe with netdatacli ping, the timeoutSeconds parameter has a much smaller impact. This is because the netdatacli ping command returns almost immediately.

My point is that users who have previously relied on the timeoutSeconds value to estimate Netdata startup time might need to adjust their calculations.

witalisoft · 2024-08-19T10:05:14Z

@ilyam8

We're comparing the time it takes for Kubernetes to determine an application is unhealthy based on HTTP and EXEC types of liveness/readiness probes.

When using an HTTP liveness probe, the time it takes for Kubernetes to consider the application unhealthy is significantly influenced by the timeoutSeconds parameter. This is because the HTTP request can hang for the full timeoutSeconds duration, especially during application startup or under heavy load.

In contrast, when using an EXEC liveness probe with netdatacli ping, the timeoutSeconds parameter has a much smaller impact. This is because the netdatacli ping command returns almost immediately.

My point is that users who have previously relied on the timeoutSeconds value to estimate Netdata startup time might need to adjust their calculations.

probably, but having a bigger installation of netdata requires you to tune those values nevertheless

in our use case changing from HTTP to EXEC probes doesn't make any difference, the timeout is way below the periodSeconds similar to the default values.

livenessProbe:
  failureThreshold: 10
  periodSeconds: 60
  successThreshold: 1
  timeoutSeconds: 10
readinessProbe:
  failureThreshold: 5
  periodSeconds: 60
  successThreshold: 1
  timeoutSeconds: 10

liveness/readiness exec probes duration: 3m3s

conditions:
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:39:44Z"
  status: "True"
  type: Initialized
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:42:47Z"
  status: "True"
  type: Ready
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:42:47Z"
  status: "True"
  type: ContainersReady
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:39:44Z"
  status: "True"
  type: PodScheduled

liveness/readiness http probes duration: 3m2s

conditions:
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:50:36Z"
  status: "True"
  type: Initialized
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:53:38Z"
  status: "True"
  type: Ready
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:53:38Z"
  status: "True"
  type: ContainersReady
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:50:36Z"
  status: "True"
  type: PodScheduled

ilyam8 · 2024-08-19T10:21:52Z

probably, but having a bigger installation of netdata requires you to tune those values nevertheless

Right. I meant that it might create problems for existing installations, but I don't see it as a blocker.

ilyam8 · 2024-08-23T15:31:59Z

Closing in favour of #436

child: use 127.0.0.1 for httpGet probes by default

0b74a6d

ilyam8 requested review from a team as code owners August 7, 2024 15:34

ilyam8 requested review from witalisoft and M4itee August 7, 2024 15:34

ilyam8 mentioned this pull request Aug 23, 2024

use 'netdatacli ping' for liveness/readiness probes #436

Merged

ilyam8 closed this Aug 23, 2024

ilyam8 deleted the child-use-localhost-for-probe branch August 23, 2024 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

child: use 127.0.0.1 for httpGet probes if `hostNetwork` field is `true` #434

child: use 127.0.0.1 for httpGet probes if `hostNetwork` field is `true` #434

ilyam8 commented Aug 7, 2024 •

edited

Loading

ilyam8 commented Aug 7, 2024

BirknerAlex commented Aug 7, 2024 •

edited

Loading

ilyam8 commented Aug 7, 2024 •

edited

Loading

BirknerAlex commented Aug 7, 2024 •

edited

Loading

witalisoft commented Aug 12, 2024 •

edited

Loading

ilyam8 commented Aug 12, 2024

stelfrag commented Aug 14, 2024

ilyam8 commented Aug 14, 2024

witalisoft commented Aug 14, 2024

ilyam8 commented Aug 14, 2024

ilyam8 commented Aug 14, 2024 •

edited

Loading

witalisoft commented Aug 14, 2024

ilyam8 commented Aug 19, 2024 •

edited

Loading

witalisoft commented Aug 19, 2024

ilyam8 commented Aug 19, 2024

ilyam8 commented Aug 23, 2024

child: use 127.0.0.1 for httpGet probes if hostNetwork field is true #434

child: use 127.0.0.1 for httpGet probes if hostNetwork field is true #434

Conversation

ilyam8 commented Aug 7, 2024 • edited Loading

ilyam8 commented Aug 7, 2024

BirknerAlex commented Aug 7, 2024 • edited Loading

ilyam8 commented Aug 7, 2024 • edited Loading

BirknerAlex commented Aug 7, 2024 • edited Loading

witalisoft commented Aug 12, 2024 • edited Loading

ilyam8 commented Aug 12, 2024

stelfrag commented Aug 14, 2024

ilyam8 commented Aug 14, 2024

witalisoft commented Aug 14, 2024

ilyam8 commented Aug 14, 2024

ilyam8 commented Aug 14, 2024 • edited Loading

witalisoft commented Aug 14, 2024

ilyam8 commented Aug 19, 2024 • edited Loading

witalisoft commented Aug 19, 2024

ilyam8 commented Aug 19, 2024

ilyam8 commented Aug 23, 2024

child: use 127.0.0.1 for httpGet probes if `hostNetwork` field is `true` #434

child: use 127.0.0.1 for httpGet probes if `hostNetwork` field is `true` #434

ilyam8 commented Aug 7, 2024 •

edited

Loading

BirknerAlex commented Aug 7, 2024 •

edited

Loading

ilyam8 commented Aug 7, 2024 •

edited

Loading

BirknerAlex commented Aug 7, 2024 •

edited

Loading

witalisoft commented Aug 12, 2024 •

edited

Loading

ilyam8 commented Aug 14, 2024 •

edited

Loading

ilyam8 commented Aug 19, 2024 •

edited

Loading