-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted #1158
Comments
UPD. Don't want to complicate this github issue with different errors, but trying now the same cron job multiple times gives me different error. Don't know how it is related to the original one. Tell me if I should open another issue for it or not. Stack trace:
Currently I have a script that reproduced this error after ~7 minutes 3 times in a row (will try more times, but looks "consistently reproducible"). All this script does is it opens 16 streams in parallel, reading data from Datastore and saving it (streaming) into a gzipped file. This is when running on my local MacOS.
|
OK, there are a couple of issues here. Tracking down the source of these errors may be tricky for the reasons you mentioned, but they are definitely the result of errors being sent by the server in some form. You also mention that you can't |
I'm confident that these errors we're unable to try catch. We have a code like: try {
await datastore.get(...)
} catch (err) {
console.error('datastore error!', err)
} This try { ...
setTimeout(() => throw new Error('catch me if you can!'), 500)
} catch .... AFAIK - nothing can catch such error, so it'll become Or, is there a way to do an experiment, add some |
Well, grpc doesn't throw that error. It just passes it along to the client (Datastore, in this case). I am fairly certain that Datastore propagates the error as a promise rejection, but either way, it's not a grpc issue if Datastore is propagating that error improperly. |
We started to observe the same issue since yesterday. We get a lot of We have no error logs on the backend which could help us track down this issue. |
So far, our best guess has been that it has been related to the amount of concurrent requests we are doing. But the numbers are not consistent. After enabling tracing and using At some point we receive:
Here is some logs from around the time it starts failing.
|
These logs show that the HTTP/2 streams are being closed with RST_STREAM frames with error codes 2 (INTERNAL_ERROR) and 11 (ENHANCE_YOUR_CALM which translates to RESOURCE_EXHAUSTED). It is very likely (but not 100% guaranteed) that these error codes are being sent by the server and not generated on the client. Specifically, they are likely being generated by the server's HTTP/2 implementation, not gRPC. You mentioned that you are using grpc-go on the server, so you may have some luck filing a bug report with the grpc-go repository and including a link to this comment. |
Ok, thank you for your help. |
After moving from node 12 to node 13, we no longer have the issue. The fix is scheduled for v10.17.1 and v12.13.2. |
Thank you for sharing those extra details. |
the error is disappear after I upgrade to |
Thanks guys for information, Does google Cloud support Nodejs 13 for AppEngine deployment for Production use, Sorry for adding different context to original topic but are inter-link. |
I've upgraded to node |
This would seem to relate to this issue as well: googleapis/nodejs-datastore#525. So to avoid the error, it basically comes down to finding the right batch size of concurrent requests to send at a time? |
Having this issue with node 12.17 which supposedly has nodejs/node#30684 |
Experiencing on Interestingly, following interop_extra_test
and for good measure the server
I still get the err However it appears as all the messages are getting processed |
I'm seeing this error in my application at Electron main process, just switched to Electron v11.0.0-beta.19, which runs on NodeJS 12.18.3, but it doesn't help. Interestingly I got into the scenario, where the rate of request processing doesn't really matter, it seems that there is just some kind of limit in amount of processed data. |
I made a change in grpc-js version 1.2.0 that changed the timing of keepalive pings. That may address some of the problems reported here. |
Changing to grpc-js 1.2.0 didn't make a difference for me, however I just switched to Electron 12.beta1 (which includes NodeJS 14) and the problem disappeared for me. |
This bug in grpc-js or in node has turned out to become a nightmare for us and it's affecting our customers. We are using firestore in combination with electron. As electron 12 is not stable yet (and crashes a lot), and because there are no working grpc-node prebuilt binaries for electron 10.1, 10.2, 10.3 and 11, we had to downgrade our app to electron 10.0.1, downgrade firebase to v7.13.2 (because that's the last version that relies on grpc and not grpc-js), and upgrade grpc in the @firebase/firestore dependencies to grpc v1.24.4 using yarn resolutions and npm shrinkwrap (because we use it in two projects). As we currently are forced to use electron 10.0.1, we are missing out on many fixes of the subsequent patch versions. This makes it very hard for us to provide a stable app version to our customers... |
We switched to |
I'm using Electron 12 beta for a few months now and it's quite stable, including elimination of this bug in |
For us in the latest Electron 12 beta there are multiple crashes and we can't rely on its general stability. However, we can confirm that |
I'm encountering this in a gRPC client that makes roughly ~1M requests per day, currently using Node.js 10.x. When this happens, all open streams fail and seemingly cannot be recovered. As @kirillgroshkov stated, it's impossible (or at least really hard) to try/catch this. I'll report back here if/when I get to test with Node.js 14.x. |
@jacoscaz Any news ? How did you solve it ? |
Hi all. @Limule I did try using Node 14.x and, more recently, 16.x but I'm still getting these errors. However, I should note that moving to newer versions of Node did increase the uptime between crashes, going from ~1h15m to ~2h30m. We're not actively working on this ATM but monitoring the situation. |
PR #1666 has been published in grpc-js version 1.3.0. It adds a channel option |
…JavaScript implementation) due to issue/error under high load (See: grpc/grpc-node#1158)
Had the same problem. Fixed by the following code: const server = new grpc.Server({
'grpc-node.max_session_memory': Number.MAX_SAFE_INTEGER
}); |
This solves the issue! Thanks @paulish ! |
The bug is related with this: grpc-node/packages/grpc-js/src/subchannel.ts Lines 414 to 424 in 3f71020
The default value is |
It seems to be working in my tests. Can you provide a complete example that demonstrates the problem you are experiencing? |
I cannot provide an example because it's internal code from my company. const grpcOptions = {
// Reconnection
"grpc.initial_reconnect_backoff_ms": 1500,
"grpc.min_reconnect_backoff_ms": 1000,
"grpc.max_reconnect_backoff_ms": 5000,
// Keepalive
"grpc.http2.min_time_between_pings_ms": 5 * 60000, // grpc default
"grpc.http2.max_pings_without_data": 0,
"grpc.keepalive_permit_without_calls": 1,
"grpc.keepalive_time_ms": 10 * 1000,
"grpc.keepalive_timeout_ms": 20 * 1000,
"grpc.max_connection_idle_ms": 0,
// limits
"grpc-node.max_session_memory": Number.MAX_SAFE_INTEGER
} |
Please verify what version of the library you are using. That option was added in version 1.3.0, but the default was not set to |
Using:
|
OK, that is the right version. Well, if you can provide a code sample that has different behavior when that option is set to |
It seems that I have a similar problem. I have tested to set I may have an explanation about this issue. https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/subchannel.ts#L414
There is a default value of https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/server.ts#L345
It may be why inject |
Thank you for pointing that out. I will fix that inconsistency. |
I have published grpc-js 1.7.2 with |
13237: Update @grpc/grpc-js r=Frassle a=Frassle <!--- Thanks so much for your contribution! If this is your first time contributing, please ensure that you have read the [CONTRIBUTING](https://github.com/pulumi/pulumi/blob/master/CONTRIBUTING.md) documentation. --> # Description <!--- Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. --> We've seen some issues from customers that might be due to grpc/grpc-node#1158 which looks like it should be better after 1.7.2. ## Checklist - [x] I have run `make tidy` to update any new dependencies - [x] I have run `make lint` to verify my code passes the lint check - [ ] I have formatted my code using `gofumpt` <!--- Please provide details if the checkbox below is to be left unchecked. --> - [ ] I have added tests that prove my fix is effective or that my feature works <!--- User-facing changes require a CHANGELOG entry. --> - [x] I have run `make changelog` and committed the `changelog/pending/<file>` documenting my change <!-- If the change(s) in this PR is a modification of an existing call to the Pulumi Cloud, then the service should honor older versions of the CLI where this change would not exist. You must then bump the API version in /pkg/backend/httpstate/client/api.go, as well as add it to the service. --> - [ ] Yes, there are changes in this PR that warrants bumping the Pulumi Cloud API version <!-- `@Pulumi` employees: If yes, you must submit corresponding changes in the service repo. --> Co-authored-by: Fraser Waters <fraser@pulumi.com>
Since there have been no new comments since that fix went out, I assume this is resolved. |
Problem description
We use
@googlecloud/datastore
dependency in our code. Since some version ofgrpc-js
(currently we're on0.6.9
) we started to receive the following error in our production and staging backends, also in our cron jobs that stream over ~100K / 1M records in Datastore (sometimes after ~5 minutes, sometimes after ~30 minutes). Error details as seen in our Sentry:Reproduction steps
Very hard to give reproduction steps. Stack trace is not "async", in a way that it doesn't like to exact place where it was called in the code (like it would have done with
return await
). We know that in the backend service we're doing all kinds of Datastore calls, but NOT stream. In cron jobs we DO stream as well as other (get, save) api calls.Environment
@grpc/grpc-js@0.6.9
We definitely did NOT see this error in
0.5.x
, but I don't remember exactly since which version of0.6.x
it started to appear.Additional context
Error happens quite seldomly, maybe ~1-2 times a days on a backend service that serves ~1M requests a day. But when it fails - it fails hard, it's impossible to
try/catch
such error, and usually one "occurrence" of such error fails multiple requests from our clients. For example, last night it failed in our staging environment that was running e2e tests (many browsers open in parallel) which produced ~480 errors in one spike. So, looks like this error does not "recover the connection" very quickly.Another annoying thing of this error is that if it happens inside a long-running cron job that streams some table - we have no way to recover from that error and the whole cron job becomes "failed in the middle" (imagine running a DB migration that fails in the middle in a non-transactional way). So, if our cron job needs to run for ~3 hours and fails after 2 hours - we have no choice but to restart it from the very beginning (paying all the datastore costs).
The text was updated successfully, but these errors were encountered: