Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix signal handling #10412

Merged
merged 4 commits into from
Oct 28, 2024
Merged

Fix signal handling #10412

merged 4 commits into from
Oct 28, 2024

Conversation

butonic
Copy link
Member

@butonic butonic commented Oct 24, 2024

signal handling seems broken ... or context cancelation is not always picked up properly ... needs more work

@butonic butonic self-assigned this Oct 24, 2024
Copy link

update-docs bot commented Oct 24, 2024

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

@@ -515,7 +514,7 @@ func trap(s *Service, ctx context.Context) {
}
}
s.Log.Debug().Str("service", "runtime service").Msgf("terminating with signal: %v", s)
os.Exit(0)
//os.Exit(0) // this seems to cause an early exit that prevents services from shitting down properly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment made my day 🤣

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was late 🤣

@butonic
Copy link
Member Author

butonic commented Oct 25, 2024

  1. not only nats neets to wait 3 sec. every service if run in single server mode (as in kubernetes) needs to wait to let the service deregister itself. -> a global sleep instead of the exit call.
  2. we need to find a way to make the registry 'forget' services. no matter what we do to shutdown gracefully ... kubernetes might decide to OOM kill us before we can act ...

@butonic butonic marked this pull request as ready for review October 28, 2024 10:23
@butonic butonic requested a review from kulmann as a code owner October 28, 2024 10:23
@butonic butonic requested a review from fschade October 28, 2024 10:23
@butonic
Copy link
Member Author

butonic commented Oct 28, 2024

@fschade This is far from what we need to do. But it will at least give all reva services 3sec to deregister themselves.

However, there as still a problem with the natsjskv registry: we are using a TTL and nats will forget the service nodes. Unfortunately, it does not emit an event for the delete, causing the caches in our go micro clients to never forget the nodes.

Currently nodes are only forgotten if the service properly deregisters itself (which this PR fixes). If a node is OOM killed ... we don't forget it. I'll create a subseqent PR to put the TTL into the registry values and ignore them if they timed out.

Signed-off-by: Jörn Friedrich Dreyer <jfd@butonic.de>
Signed-off-by: Jörn Friedrich Dreyer <jfd@butonic.de>
Signed-off-by: Jörn Friedrich Dreyer <jfd@butonic.de>
Signed-off-by: Jörn Friedrich Dreyer <jfd@butonic.de>
@butonic butonic force-pushed the fix-signal-handling branch from 48d9918 to 04632ad Compare October 28, 2024 13:47
@@ -515,7 +515,8 @@ func trap(s *Service, ctx context.Context) {
}
}
s.Log.Debug().Str("service", "runtime service").Msgf("terminating with signal: %v", s)
os.Exit(0)
time.Sleep(3 * time.Second) // give the services time to deregister
os.Exit(0) // FIXME this cause an early exit that prevents services from shitting down properly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still here btw 😄 but we can keep it as a reminder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking of removing the comment, but it feels right like it is. reva has no way to properly handle a context done triggered shutdown. A runtime.Shutdown(ctx) function like the http server would be great and would allow us to properly shut down all services ... currently we have no way of notifying decomposedfs to flush data to disk ...

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
84.7% Duplication on New Code (required ≤ 50%)

See analysis details on SonarCloud

@butonic
Copy link
Member Author

butonic commented Oct 28, 2024

regarding the code duplication, I wonder if we can use generics to reduce the amount of boilerplate we have ...

@butonic butonic mentioned this pull request Oct 28, 2024
2 tasks
@butonic butonic merged commit a79e3b3 into master Oct 28, 2024
2 of 3 checks passed
@butonic butonic deleted the fix-signal-handling branch October 28, 2024 14:24
ownclouders pushed a commit that referenced this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants