Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pepr Watch is not Responding to Changes after 90 mins #745

Closed
cmwylie19 opened this issue Apr 16, 2024 · 6 comments
Closed

Pepr Watch is not Responding to Changes after 90 mins #745

cmwylie19 opened this issue Apr 16, 2024 · 6 comments
Assignees
Milestone

Comments

@cmwylie19
Copy link
Collaborator

cmwylie19 commented Apr 16, 2024

Environment

Device and OS: UDS Core
App version:

  • UDS Core v0.19.0 (Pepr v0.29.0) Reported by Jordan
  • UDS Core v0.18.0 Pepr 0.28.6 Reported by Wayne

Watch controller started responding after rolling. Only pod logs were health checks

Kubernetes distro being used:
Other: EKS Kubernetes 1.29

Steps to reproduce

  1. Working on reproducing soak

Expected result

Upon disconnection, it reconnects

Actual Result

Visual Proof (screenshots, videos, text, etc)

Severity/Priority

Additional Context

Add any other context or screenshots about the technical debt here.

Well Known Bug:
kubernetes-client/csharp#533
kubernetes-client/javascript#596

@cmwylie19 cmwylie19 self-assigned this Apr 16, 2024
@cmwylie19
Copy link
Collaborator Author

here is a temporary workaround that should force the watcher pod to reconnect. https://gist.github.com/cmwylie19/2c07e0e6f0962b8f18999488d1646a4a

@cmwylie19
Copy link
Collaborator Author

cmwylie19 commented Apr 19, 2024

something with the AWS VPC CNI could have issues and is sometimes dropping policy. It could have something to do with it.

Related Issue - aws/amazon-vpc-cni-k8s#2103

@cmwylie19
Copy link
Collaborator Author

Fixed by #766. If you experience this, use the environment variable PEPR_RESYNCINTERVALSECONDS and set it to something very low like 23 seconds. We soaked a module that was effected for 19 hours twice and did not see the problem

@jeff-mccoy
Copy link
Member

This is not technically resolved and setting a very low resync threshold like this is a really bad practice in production. Agree we either need to look at dropping Istio for this or looking at MESH_EXTERNAL

@cmwylie19
Copy link
Collaborator Author

This should be solved based on c0d3aaa and the KFC informer pattern. All soak tests have passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants