Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When left running for a while, flintlockd's grpc server becomes inaccessible: rpc error: code = Unavailable desc = failed to receive server preface within timeout #503

Closed
Callisto13 opened this issue Aug 17, 2022 · 8 comments
Labels
kind/bug Something isn't working

Comments

@Callisto13
Copy link
Member

What happened:
I have seen this a couple of times now.

On infrastructure which has been left running for a little while (like 24hrs+), further requests to flintlock's service result in this error:

rpc error: code = Unavailable desc = failed to receive server preface within timeout

Restarting the server (in my case with systemctl restart flintlockd.service on equinix) fixes the issue, and further requests are served just fine.

What did you expect to happen:
Flintlock's long running grpc service should not seize up over time.

How to reproduce it:
I have been testing with Equinix a lot these days, so for all I know it could be a problem with that specific setup/environment.

  • Use the LMAT env to create a host with flintlock running
  • Create some mvms
  • Leave it a day or so
  • Come back and try to list them
  • Observe the error
  • SSH onto the device
  • Run systemctl restart flintlockd.service
  • See that a list now works

Anything else you would like to add:
It should be proven out that this also happens/does not happen on another environment.

Environment:

  • flintlock version: flintlock v0.1.1-3-g982d429
  • containerd version: containerd github.com/containerd/containerd v1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
  • OS (e.g. from /etc/os-release): Linux host-0 5.13.0-44-generic #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
@Callisto13
Copy link
Member Author

Callisto13 commented Aug 18, 2022

i think the “we don’t close client connections in capmvm” thing is the issue

on the frozen host i had 1012 established connections to the flintlockd server. i tried a hammertime list and it hung then returned the preface error. back on the host i killed all the connections with netstat -a | awk '/ESTABLISHED/ && /:9090/{print $5}' | cut -f 2 -d ":" | xargs -I {} ss -K dport {} , and then ran hammertime again and got a successful list

@Callisto13
Copy link
Member Author

interesting... grpc run as systemd service with open file limits grpc/grpc-go#1261

@Callisto13
Copy link
Member Author

and from @richardcase's parallel experiment:

If we don't call close on the grpc connection then server side go routines keep climbing...1 per connection until it blows up.
I just added conn.Close() and server goroutines are stable

@Callisto13
Copy link
Member Author

fix is to actually close client connections in capmvm, being pred now

@richardcase
Copy link
Member

Without closing the client connection explicitly the number of server goroutines (and allocs) continues to rise until the server stops processing requests:

image

And if we explicitly close the connections the goroutines and alows stay stable:

image (7)

@Callisto13
Copy link
Member Author

Callisto13 commented Aug 18, 2022

@richardcase in a previous team we put in a /debug endpoint to our server, to return things like current goroutine count and other interesting bits, worth adding that to flintlock?

@richardcase
Copy link
Member

@Callisto13 - yes for sure. The screenshots above come from the pprof endpoint i've added to flintlock. Are you thinking that we have our own handler that we serve from /debug that we add our own interesting stuff? Sounds like a great idea to me.

@Callisto13
Copy link
Member Author

either or both

other metrics could be like total mvm count etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants