Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

child: use 127.0.0.1 for httpGet probes if hostNetwork field is true #434

Closed
wants to merge 1 commit into from

Conversation

ilyam8
Copy link
Member

@ilyam8 ilyam8 commented Aug 7, 2024

As suggested in https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#http-probes

Suppose the container listens on 127.0.0.1 and the Pod's hostNetwork field is true. Then host, under httpGet, should be set to 127.0.0.1.

cc @BirknerAlex

@ilyam8 ilyam8 requested review from a team as code owners August 7, 2024 15:34
@ilyam8
Copy link
Member Author

ilyam8 commented Aug 7, 2024

Hey, @BirknerAlex. Do we really need httpGet.host?

@BirknerAlex
Copy link
Contributor

BirknerAlex commented Aug 7, 2024

In my case I kept hostNetwork: true because I thought its needed to crawl for network interface stats or something like that (monitoring stuff).

I only bound the web to 127.0.0.1 instead, so the PodIP is still the node IP (public, which is used for the probes as well), so if you ask me it's required to override it).

@ilyam8
Copy link
Member Author

ilyam8 commented Aug 7, 2024

In my case I kept hostNetwork: true because I thought its needed to crawl for network interface stats or something like that (monitoring stuff)

Yes. It is needed for:

  • Host system networking stack monitoring.
  • Monitoring applications running on the host (e.g. Kubelet).
  • Discovering all current network sockets and building a network-map (Top->Network connections).

I mean if we just do the following:

{{- if .Values.child.hostNetwork }}
  host: 127.0.0.1
{{- end}}

It will do, right? Is there a case when hostNetwork is false and you need to set host? I am thinking of removing host and just setting the host to 127.0.0.1 if hostNetwork is true.

@BirknerAlex
Copy link
Contributor

BirknerAlex commented Aug 7, 2024

Yes you're right, that should be fine as well instead of the additional value.

//edit: Okay I might found one reason to keep it, not in my use case but using hostNetwork: true and binding the web to a specific additional IP of the host (e.g. if the cluster has multiple IPs - e.g. one internal, one external) you maybe want to define the IP instead, but sure you could also additionally bind it to 127.0.0.1 as well. But that would only work on a single node cluster.

@witalisoft
Copy link
Contributor

witalisoft commented Aug 12, 2024

what about changing the type of the readiness/liveness probes to exec and simplifying it ?

-      livenessProbe:
-        failureThreshold: 3
-        httpGet:
-          path: /api/v1/info
-          port: http
-          scheme: HTTP
-        periodSeconds: 30
-        successThreshold: 1
-        timeoutSeconds: 1
+        livenessProbe:
+          failureThreshold: 3
+          exec:
+            command:
+            - /usr/sbin/netdatacli
+            - ping
+          periodSeconds: 30
+          successThreshold: 1
+          timeoutSeconds: 1

@ilyam8
Copy link
Member Author

ilyam8 commented Aug 12, 2024

@stelfrag can we use netdatacli ping/pong for liveness/readiness probes? Does a ping response mean Netdata is running and ready to serve requests?

@stelfrag
Copy link

@stelfrag can we use netdatacli ping/pong for liveness/readiness probes? Does a ping response mean Netdata is running and ready to serve requests?

The command processing (cli) is initialized last so receiving a pong means the agent is ready to respond to requests at that point -- otherwise you get

uv_pipe_connect(): no such file or directory
Make sure the netdata service is running.

This will be the case after startup and while the metrics database is initializing (if it is taking a while)

Requests (queries) to "children" may not be available even if you get a pong as those continue to initialize async and populate contexts etc.

@ilyam8
Copy link
Member Author

ilyam8 commented Aug 14, 2024

Requests (queries) to "children" may not be available even if you get a pong as those continue to initialize async and populate contexts etc.

This is also true for HTTP GET to /api/v1/info, right? So we can switch to netdata cli by default for both liveness and readiness probes.

@witalisoft
Copy link
Contributor

@stelfrag

if the netdatacli ping returns:

uv_pipe_connect(): no such file or directory
Make sure the netdata service is running.

Is it also followed by the exit code other than 0 ?

@ilyam8
Copy link
Member Author

ilyam8 commented Aug 14, 2024

Yes

root@pve-deb-work:/opt/netdata/usr/sbin# ./netdatacli ping; echo $?
pong
0

root@pve-deb-work:/opt/netdata/usr/sbin# systemctl stop netdata
root@pve-deb-work:/opt/netdata/usr/sbin# ./netdatacli ping; echo $?
uv_pipe_connect(): no such file or directory
Make sure the netdata service is running.
255

@ilyam8
Copy link
Member Author

ilyam8 commented Aug 14, 2024

@witalisoft One caveat is that "timeoutSeconds" becomes almost useless because netdatacli ping returns immediately (HTTP request hangs for timeout seconds).

Startup time:

  • HTTP: periodSeconds * failureThreshold * timeoutSeconds.
  • CLI: periodSeconds * failureThreshold (timeoutSeconds times less when timeoutSeconds > 1).

@witalisoft
Copy link
Contributor

@ilyam8

One caveat is that "timeoutSeconds" becomes almost useless because netdatacli ping returns immediately (HTTP request hangs for timeout seconds).

Startup time:

  • HTTP: periodSeconds * failureThreshold * timeoutSeconds.
  • CLI: periodSeconds * failureThreshold (timeoutSeconds times less when timeoutSeconds > 1).

so potentially the implication will be that we can get more often liveness/readiness failures when the netdata is slowly responding ?

@ilyam8
Copy link
Member Author

ilyam8 commented Aug 19, 2024

@witalisoft

We're comparing the time it takes for Kubernetes to determine an application is unhealthy based on HTTP and EXEC types of liveness/readiness probes.

  • When using an HTTP liveness probe, the time it takes for Kubernetes to consider the application unhealthy is significantly influenced by the timeoutSeconds parameter. This is because the HTTP request can hang for the full timeoutSeconds duration, especially during application startup or under heavy load.
  • In contrast, when using an EXEC liveness probe with netdatacli ping, the timeoutSeconds parameter has a much smaller impact. This is because the netdatacli ping command returns almost immediately.

My point is that users who have previously relied on the timeoutSeconds value to estimate Netdata startup time might need to adjust their calculations.

@witalisoft
Copy link
Contributor

@ilyam8

We're comparing the time it takes for Kubernetes to determine an application is unhealthy based on HTTP and EXEC types of liveness/readiness probes.

  • When using an HTTP liveness probe, the time it takes for Kubernetes to consider the application unhealthy is significantly influenced by the timeoutSeconds parameter. This is because the HTTP request can hang for the full timeoutSeconds duration, especially during application startup or under heavy load.
  • In contrast, when using an EXEC liveness probe with netdatacli ping, the timeoutSeconds parameter has a much smaller impact. This is because the netdatacli ping command returns almost immediately.

My point is that users who have previously relied on the timeoutSeconds value to estimate Netdata startup time might need to adjust their calculations.

probably, but having a bigger installation of netdata requires you to tune those values nevertheless

in our use case changing from HTTP to EXEC probes doesn't make any difference, the timeout is way below the periodSeconds similar to the default values.

livenessProbe:
  failureThreshold: 10
  periodSeconds: 60
  successThreshold: 1
  timeoutSeconds: 10
readinessProbe:
  failureThreshold: 5
  periodSeconds: 60
  successThreshold: 1
  timeoutSeconds: 10

liveness/readiness exec probes duration: 3m3s

conditions:
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:39:44Z"
  status: "True"
  type: Initialized
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:42:47Z"
  status: "True"
  type: Ready
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:42:47Z"
  status: "True"
  type: ContainersReady
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:39:44Z"
  status: "True"
  type: PodScheduled

liveness/readiness http probes duration: 3m2s

conditions:
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:50:36Z"
  status: "True"
  type: Initialized
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:53:38Z"
  status: "True"
  type: Ready
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:53:38Z"
  status: "True"
  type: ContainersReady
- lastProbeTime: null
  lastTransitionTime: "2024-08-19T09:50:36Z"
  status: "True"
  type: PodScheduled

@ilyam8
Copy link
Member Author

ilyam8 commented Aug 19, 2024

probably, but having a bigger installation of netdata requires you to tune those values nevertheless

Right. I meant that it might create problems for existing installations, but I don't see it as a blocker.

@ilyam8
Copy link
Member Author

ilyam8 commented Aug 23, 2024

Closing in favour of #436

@ilyam8 ilyam8 closed this Aug 23, 2024
@ilyam8 ilyam8 deleted the child-use-localhost-for-probe branch August 23, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants