Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pepr Watch is not Responding to Changes after 90 mins #745

Closed
cmwylie19 opened this issue Apr 16, 2024 · 10 comments
Closed

Pepr Watch is not Responding to Changes after 90 mins #745

cmwylie19 opened this issue Apr 16, 2024 · 10 comments
Assignees
Milestone

Comments

@cmwylie19
Copy link
Contributor

cmwylie19 commented Apr 16, 2024

Environment

Device and OS: UDS Core
App version:

  • UDS Core v0.19.0 (Pepr v0.29.0) Reported by Jordan
  • UDS Core v0.18.0 Pepr 0.28.6 Reported by Wayne

Watch controller started responding after rolling. Only pod logs were health checks

Kubernetes distro being used:
Other: EKS Kubernetes 1.29

Steps to reproduce

  1. Working on reproducing soak

Expected result

Upon disconnection, it reconnects

Actual Result

Visual Proof (screenshots, videos, text, etc)

Severity/Priority

Additional Context

Add any other context or screenshots about the technical debt here.

Well Known Bug:
kubernetes-client/csharp#533
kubernetes-client/javascript#596

@cmwylie19 cmwylie19 moved this from 🆕 New to 🏗 In progress in Pepr Project Board Apr 16, 2024
@cmwylie19 cmwylie19 self-assigned this Apr 16, 2024
@cmwylie19
Copy link
Contributor Author

here is a temporary workaround that should force the watcher pod to reconnect. https://gist.github.com/cmwylie19/2c07e0e6f0962b8f18999488d1646a4a

@cmwylie19
Copy link
Contributor Author

cmwylie19 commented Apr 19, 2024

something with the AWS VPC CNI could have issues and is sometimes dropping policy. It could have something to do with it.

Related Issue - aws/amazon-vpc-cni-k8s#2103

@cmwylie19
Copy link
Contributor Author

Fixed by #766. If you experience this, use the environment variable PEPR_RESYNCINTERVALSECONDS and set it to something very low like 23 seconds. We soaked a module that was effected for 19 hours twice and did not see the problem

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Pepr Project Board May 1, 2024
@jeff-mccoy
Copy link
Member

This is not technically resolved and setting a very low resync threshold like this is a really bad practice in production. Agree we either need to look at dropping Istio for this or looking at MESH_EXTERNAL

@cmwylie19
Copy link
Contributor Author

This should be solved based on c0d3aaa and the KFC informer pattern. All soak tests have passed

@joelmccoy
Copy link

Ran into this the other day using uds-core 0.26.1 (pepr v0.36.0) in our internal environments with PEPR_LAST_SEEN_LIMIT_SECONDS set to 60. Didn't have the chance to catch the watch metrics, but will capture those if it occurs again.

@robsulllly
Copy link

Just ran through a UDS Core and bundling demo with Raytheon and ran into the problem where I had to restart the pepr-uds-core-watcher pod in order to get a successful deployment of mattermost, postgres operator, minio, etc. It looks like pepr blocks some of the secrets from being created, which then prevents mattermost from coming up. Killing the pepr watcher pod and then re-doing the deployment fixes the problem.

This can happen regardless of whether the cluster/platform has been up for hours or just came up a couple minutes ago. Does not happen every time though.

UDS Core 0.27.2 - k3d-core-demo:0.27.2 (also existed on 0.26.1)
ARM64
ghcr.io/defenseunicorns/packages/uds/dev-minio:0.0.2
ghcr.io/defenseunicorns/packages/uds/postgres-operator:1.12.2-uds.2-upstream
ghcr.io/defenseunicorns/packages/uds/mattermost:9.10.1-uds.0-upstream
Unfortunately, I was in the middle of a demo and had to press forward, kill the pepr watcher on the side, and re-deploy so I don't have the previous logs.

@JoeHCQ1
Copy link

JoeHCQ1 commented Sep 20, 2024

Just got it today, happened while deploying core into a cluster, by the time I got to Neuvector, Pepr was AWOL. Had to kill the watcher, and bounce all affected Neuvector pods.

Core version 0.27.2.

@cmwylie19
Copy link
Contributor Author

cmwylie19 commented Nov 20, 2024

fixed in udici defenseunicorns/kubernetes-fluent-client#399, we have not heard any more complaints. Feel free to open it back up if it occurs again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

6 participants