Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is no response after grpc runs for a period of time #6858

Closed
itgcl opened this issue Dec 15, 2023 · 15 comments
Closed

There is no response after grpc runs for a period of time #6858

itgcl opened this issue Dec 15, 2023 · 15 comments

Comments

@itgcl
Copy link

itgcl commented Dec 15, 2023

What version of gRPC are you using?

v.1.59.0

What version of Go are you using (go version)?

v1.19

What operating system (Linux, Windows, …) and version?

linux and mac

What did you do?

Initialize and connect the grpc client, and the call method will respond normally. Call the method again after the service runs for two hours, no response until the timeout.
Through debug, I found that the request will wait until the timeout exit in this method.
cc.firstResolveEvent.HasFired() Return false.

func (cc *ClientConn) waitForResolvedAddrs(ctx context.Context) error {
// This is on the RPC path, so we use a fast path to avoid the
// more-expensive "select" below after the resolver has returned once.
if cc.firstResolveEvent.HasFired() {
return nil
}
select {
case <-cc.firstResolveEvent.Done():
return nil
case <-ctx.Done():
return status.FromContextError(ctx.Err()).Err()
case <-cc.ctx.Done():
return ErrClientConnClosing
}
}

@atollena
Copy link
Collaborator

Could this be a dup of #6783?

My company hasn't been using gRPC 1.59.0 because of that bug.

https://github.com/grpc/grpc-go/releases doesn't list 1.59.0, although the tag https://github.com/grpc/grpc-go/releases/tag/v1.59.0 still exists. Perhaps it's an attempt to "unrelease" it? Is it because of this problem?

1.60.0 solves this issue but it looks like it introduces some other problems (#6854).

We've been running 1.58 and it's working great.

@easwars
Copy link
Contributor

easwars commented Dec 15, 2023

grpc/grpc-go/releases doesn't list 1.59.0, although the tag v1.59.0 (release) still exists. Perhaps it's an attempt to "unrelease" it? Is it because of this problem?

That is weird. We haven't made any attempt so far to unrelease 1.59.

1.60.0 solves this issue but it looks like it introduces some other problems (#6854).

We did have a deadlock that could happen in 1.59 if the channel received an update from the resolver around the same time that it was trying to go idle. And that has been fixed now.

And we pushed a fix to #6854 as well. We haven't done a patch release so far.

@dfawley
Copy link
Member

dfawley commented Dec 15, 2023

My company hasn't been using gRPC 1.59.0 because of that bug.

Disabling idleness would be a workaround for using 1.59.0 and avoiding that bug.

grpc/grpc-go/releases doesn't list 1.59.0, although the tag v1.59.0 (release) still exists. Perhaps it's an attempt to "unrelease" it? Is it because of this problem?

Well that's strange. For me it shows up at the top of the list, even higher on the page than 1.60, which I noticed earlier and thought was also strange.

1.60.0 solves this issue but it looks like it introduces some other problems (#6854).

FWIW, this issue can only happen if you are using grpc.NumStreamWorkers, which is uncommon (though perhaps you are?).

@easwars
Copy link
Contributor

easwars commented Dec 15, 2023

@itgcl : As mentioned in the previous comments, we expect the issue to be resolved by upgrading to v1.60.0. Please let us know if that helps.

Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Dec 21, 2023
@itgcl
Copy link
Author

itgcl commented Dec 27, 2023

@easwars : I tried to upgrade to v.16.0, but this bug still exists. At present, I have been downgraded to v1.58. It runs normally.

@itgcl
Copy link
Author

itgcl commented Dec 27, 2023

Could this be a dup of #6783?

My company hasn't been using gRPC 1.59.0 because of that bug.

https://github.com/grpc/grpc-go/releases doesn't list 1.59.0, although the tag https://github.com/grpc/grpc-go/releases/tag/v1.59.0 still exists. Perhaps it's an attempt to "unrelease" it? Is it because of this problem?

1.60.0 solves this issue but it looks like it introduces some other problems (#6854).

We've been running 1.58 and it's working great.

@atollena: I tried to upgrade to v.16.0, but this bug still exists, I have been downgraded to v1.58.

@github-actions github-actions bot removed the stale label Dec 27, 2023
@whs
Copy link

whs commented Dec 27, 2023

+1, we upgraded to 1.60.1 and the bug persists. Rolling back to 1.58 as well

@easwars
Copy link
Contributor

easwars commented Dec 27, 2023

@itgcl @whs

@whs
Copy link

whs commented Dec 27, 2023

In our case we use xDS. After a day xDS will report that name resolving result in nothing, probably because xDS client lose its connection to the xDS server.

It is pretty reproducible on our test environment, but it's a real application - we don't have minimal reproduction case yet. It take a few hours before the issue kicks in though.

One of my team does not experience the issue (yet), so we're suspecting that the issue may depend on usage volume as well.

@easwars
Copy link
Contributor

easwars commented Dec 27, 2023

IIUC, you are seeing the issue in the channel between the xDS client inside gRPC and the xDS management server, and not between your application's client and server?

We do have detailed logs for the xDS client that you can turn on by setting the following env vars: GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info. Would you be able to give that go in your test environment? Thanks.

@itgcl
Copy link
Author

itgcl commented Dec 28, 2023

@easwars

  • Did you try disabling channel idleness? https://pkg.go.dev/google.golang.org/grpc#WithIdleTimeout
    no
  • Do you have a way to reproduce the issue?
    yes
  • And does it always take two hours for the issue to happen?
    no, The time doesn't seem to be fixed. He may have happened in half an hour.
  • Are you making unary or streaming RPCs?
    unary

@easwars
Copy link
Contributor

easwars commented Dec 28, 2023

@itgcl : Would it be possible for you to share your repro? Thanks.

Copy link

github-actions bot commented Jan 3, 2024

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Jan 3, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 10, 2024
@realityone
Copy link

We have a quite similar problem in Dragonfly2. The resolver is fetching the latest nodes periodically, when the nodes changed, it will call cc.UpdateState as resolved to latest.

https://github.com/dragonflyoss/Dragonfly2/blob/abcf7ea9e4bd14dc43a08ac600763cf34cdd6e14/pkg/resolver/scheduler_resolver.go#L88

	addrs, err := r.dynconfig.GetResolveSchedulerAddrs()
	if err != nil {
		plogger.Errorf("resolve addresses error %v", err)
		return
	}

	if reflect.DeepEqual(r.addrs, addrs) {
		return
	}
	r.addrs = addrs

	if err := r.cc.UpdateState(resolver.State{
		Addresses: addrs,
	}); err != nil {
		plogger.Errorf("resolver update ClientConn error %v", err)
	}

But when last grpc connection is gone, create a new grpc connection and waitForResolvedAddrs may be will into a block forever state. There is no another chance to call cc.UpdateState because all the nodes is same as before. So this will stuck at waitForResolvedAddrs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants