Pepr Watch is not Responding to Changes after 90 mins #745

cmwylie19 · 2024-04-16T11:27:16Z

Environment

Device and OS: UDS Core
App version:

UDS Core v0.19.0 (Pepr v0.29.0) Reported by Jordan
UDS Core v0.18.0 Pepr 0.28.6 Reported by Wayne

Watch controller started responding after rolling. Only pod logs were health checks

Kubernetes distro being used:
Other: EKS Kubernetes 1.29

Steps to reproduce

Working on reproducing soak

Expected result

Upon disconnection, it reconnects

Actual Result

Visual Proof (screenshots, videos, text, etc)

Severity/Priority

Additional Context

Add any other context or screenshots about the technical debt here.

Well Known Bug:
kubernetes-client/csharp#533
kubernetes-client/javascript#596

cmwylie19 · 2024-04-16T13:12:50Z

here is a temporary workaround that should force the watcher pod to reconnect. https://gist.github.com/cmwylie19/2c07e0e6f0962b8f18999488d1646a4a

cmwylie19 · 2024-04-19T15:22:52Z

something with the AWS VPC CNI could have issues and is sometimes dropping policy. It could have something to do with it.

Related Issue - aws/amazon-vpc-cni-k8s#2103

cmwylie19 · 2024-05-01T14:28:49Z

Fixed by #766. If you experience this, use the environment variable PEPR_RESYNCINTERVALSECONDS and set it to something very low like 23 seconds. We soaked a module that was effected for 19 hours twice and did not see the problem

marshall007 · 2024-05-01T17:49:02Z

Some seemingly related issues:

Possible solution/workaround in istio/istio#8696 (comment) by creating a ServiceEntry to treat connectino to k8s api-server as MESH_EXTERNAL.

jeff-mccoy · 2024-05-09T06:48:49Z

This is not technically resolved and setting a very low resync threshold like this is a really bad practice in production. Agree we either need to look at dropping Istio for this or looking at MESH_EXTERNAL

cmwylie19 · 2024-05-23T15:24:17Z

This should be solved based on c0d3aaa and the KFC informer pattern. All soak tests have passed

joelmccoy · 2024-09-19T17:36:46Z

Ran into this the other day using uds-core 0.26.1 (pepr v0.36.0) in our internal environments with PEPR_LAST_SEEN_LIMIT_SECONDS set to 60. Didn't have the chance to catch the watch metrics, but will capture those if it occurs again.

robsulllly · 2024-09-19T18:44:01Z

Just ran through a UDS Core and bundling demo with Raytheon and ran into the problem where I had to restart the pepr-uds-core-watcher pod in order to get a successful deployment of mattermost, postgres operator, minio, etc. It looks like pepr blocks some of the secrets from being created, which then prevents mattermost from coming up. Killing the pepr watcher pod and then re-doing the deployment fixes the problem.

This can happen regardless of whether the cluster/platform has been up for hours or just came up a couple minutes ago. Does not happen every time though.

UDS Core 0.27.2 - k3d-core-demo:0.27.2 (also existed on 0.26.1)
ARM64
ghcr.io/defenseunicorns/packages/uds/dev-minio:0.0.2
ghcr.io/defenseunicorns/packages/uds/postgres-operator:1.12.2-uds.2-upstream
ghcr.io/defenseunicorns/packages/uds/mattermost:9.10.1-uds.0-upstream
Unfortunately, I was in the middle of a demo and had to press forward, kill the pepr watcher on the side, and re-deploy so I don't have the previous logs.

JoeHCQ1 · 2024-09-20T17:12:31Z

Just got it today, happened while deploying core into a cluster, by the time I got to Neuvector, Pepr was AWOL. Had to kill the watcher, and bounce all affected Neuvector pods.

Core version 0.27.2.

cmwylie19 · 2024-11-20T20:57:56Z

fixed in udici defenseunicorns/kubernetes-fluent-client#399, we have not heard any more complaints. Feel free to open it back up if it occurs again.

github-project-automation bot added this to Pepr Project Board Apr 16, 2024

github-project-automation bot moved this to 🆕 New in Pepr Project Board Apr 16, 2024

cmwylie19 moved this from 🆕 New to 🏗 In progress in Pepr Project Board Apr 16, 2024

cmwylie19 self-assigned this Apr 16, 2024

cmwylie19 mentioned this issue Apr 17, 2024

Need more logs around watch events #747

Closed

This was referenced Apr 23, 2024

chore: 🐛 use reconcile to reduce thrash in pod watch and service watch defenseunicorns/uds-core#364

Closed

fix: uds-core timeout problem #758

Closed

cmwylie19 closed this as completed May 1, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Pepr Project Board May 1, 2024

jeff-mccoy reopened this May 9, 2024

github-project-automation bot moved this from ✅ Done to 🆕 New in Pepr Project Board May 9, 2024

This was referenced May 13, 2024

Neuvector updater pod sidecar never exits defenseunicorns/uds-core#347

Closed

Pepr watch endpointslice still isn't working defenseunicorns/uds-core#279

Closed

cmwylie19 moved this from 🆕 New to 🏗 In progress in Pepr Project Board May 20, 2024

cmwylie19 added this to the v0.32.0 milestone May 23, 2024

cmwylie19 closed this as completed May 23, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Pepr Project Board May 23, 2024

cmwylie19 mentioned this issue Jun 3, 2024

Tuning Envoys: Adjust Istio Idle Timeout in UDS Core #847

Closed

4 tasks

cmwylie19 mentioned this issue Sep 18, 2024

use v8go to run Pepr modules #1149

Closed

cmwylie19 reopened this Sep 18, 2024

cmwylie19 mentioned this issue Sep 27, 2024

change fetch function in KFC to use undici fetch and stream #1180

Closed

1 task

This was referenced Oct 3, 2024

change fetch function in KFC to use undici fetch and stream #1216

Closed

Feature flag KFC watch to use http2 #1229

Closed

cmwylie19 closed this as completed Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pepr Watch is not Responding to Changes after 90 mins #745

Pepr Watch is not Responding to Changes after 90 mins #745

cmwylie19 commented Apr 16, 2024 •

edited

Loading

cmwylie19 commented Apr 16, 2024

cmwylie19 commented Apr 19, 2024 •

edited

Loading

cmwylie19 commented May 1, 2024

marshall007 commented May 1, 2024

jeff-mccoy commented May 9, 2024

cmwylie19 commented May 23, 2024

joelmccoy commented Sep 19, 2024

robsulllly commented Sep 19, 2024

JoeHCQ1 commented Sep 20, 2024 •

edited

Loading

cmwylie19 commented Nov 20, 2024 •

edited

Loading

Pepr Watch is not Responding to Changes after 90 mins #745

Pepr Watch is not Responding to Changes after 90 mins #745

Comments

cmwylie19 commented Apr 16, 2024 • edited Loading

Environment

Steps to reproduce

Expected result

Actual Result

Visual Proof (screenshots, videos, text, etc)

Severity/Priority

Additional Context

cmwylie19 commented Apr 16, 2024

cmwylie19 commented Apr 19, 2024 • edited Loading

cmwylie19 commented May 1, 2024

marshall007 commented May 1, 2024

jeff-mccoy commented May 9, 2024

cmwylie19 commented May 23, 2024

joelmccoy commented Sep 19, 2024

robsulllly commented Sep 19, 2024

JoeHCQ1 commented Sep 20, 2024 • edited Loading

cmwylie19 commented Nov 20, 2024 • edited Loading

cmwylie19 commented Apr 16, 2024 •

edited

Loading

cmwylie19 commented Apr 19, 2024 •

edited

Loading

JoeHCQ1 commented Sep 20, 2024 •

edited

Loading

cmwylie19 commented Nov 20, 2024 •

edited

Loading