-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bigquery: AppendRows
eventually deadlocks / high rate of context deadline exceeded
#9660
Comments
Another thing I notice is that we do have high rates of |
Starting to look at this. EOF on send should trigger us to mark the connection as being unhealthy, so the next send that uses the connection should cause reconnect. Possibly we're doing something unusual with a received response? Is there any other noteworthy events leading up to the spike in latency? One way to get more information here might be to add a grpc streaming interceptor that logs more detail about the stream interactions. I've got some old code that does this, but likely needs to be refreshed. I'll work on that next. |
Ah forgot to update this ticket. I root caused this issue last week with GCP Support. So the issue was that we were sending malformed proto descriptor and receiving the following error from upstream:
In our client code, we had the following: // Example proto
m := &myprotopackage.MyCompiledMessage{}
descriptorProto := protodesc.ToDescriptorProto(m.ProtoReflect().Descriptor()) instead of this: m := &myprotopackage.MyCompiledMessage{}
// Ensure noralize descriptor to collect nested messages, fix-up oneOfs, etc.
descriptorProto, err := adapt.NormalizeDescriptor(m.ProtoReflect().Descriptor())
if err != nil {
// TODO: Handle error.
} The Ultimately, we were sending batches of bad data to the backend and that would get rejected and lead to the EOF errors. We did not see the underlying bad proto error because we were not calling Couple thoughts:
|
Thanks for the additional details. Detecting malformed DescriptorProto messages at instantiation time would require us to mimic the same descriptor validation and building logic in both the clients and the service, which is problematic. That said, I've opened a potential FR (internally b/333755890) against the service to expose some kind of validation RPC that would allow us to check compatibility before sending data. The managedwriter package will currently produce opencensus metrics (I've got some feature work to enable the same on the opentelemetry side), so it's another avenue for monitoring ingestion health. The current metrics/views can be seen in the package docs: https://pkg.go.dev/cloud.google.com/go/bigquery/storage/managedwriter#pkg-variables |
Given there's no immediately actionable elements for this current issue, I'm going to go ahead and close it out at this time. However, if you have more feedback please let me know here or via another issue. |
Client
Bigquery v1.60.0
Environment
GKE
Go Environment
go v1.22
Code
Expected behavior
We'd expect that the AppendRows call complete quickly and not eventually get stuck.
Actual behavior
Actual behavior is that
AppendRows
degrades and gets stuck. We need to restart the process to recover. This degrades after a couple hours.Screenshots
Below is a screenshot of the latency of the
AppendRows
call.Other Context
In the interim, we are restarting the process when we get alerted on this issue.
What's interesting is that this process is writing to another Bigquery stream in parallel and never hits this degradation issue. There should be a 1:1 relationship with how many
AppendRows
calls are being made to each stream. For each message processed from pubsub, we write to both streams. The number of serialized rows being appended perAppendRows
call for that other stream should be quite substantially larger (~1k to 5k rows) while each row is slightly smaller.I suspect there is a remaining deadlock somewhere within the
managedwriter
. I have Google Cloud Profiler enabled for this service, but I didn't see anything noticeable from the CPU / threads profile view.Other things i tried was switching from committed stream to default stream, but that had no impact.
Let me know if there is any debugging logs I can provide @shollyman
The text was updated successfully, but these errors were encountered: