Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub: machine runs out of memory and crashes #3056

Closed
jeanbza opened this issue Mar 18, 2018 · 12 comments
Closed

pubsub: machine runs out of memory and crashes #3056

jeanbza opened this issue Mar 18, 2018 · 12 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@jeanbza
Copy link

jeanbza commented Mar 18, 2018

Hello! I'm running a simple pubsub consumer. This consumer receives a message, waits 5 seconds, and then acks the message. Running this consumer against a subscription with ~600,000 messages causes my app to repeatedly crash as it runs out of memory. This is running on GKE with no special configuration. I'm also using the java config defaults where possible. Possibly worth noting is that I do not see any warning/error logs related to these restarts.

Is this expected? What can be done to prevent this?

Screenshots

screen shot 2018-03-17 at 5 07 41 pm

screen shot 2018-03-17 at 5 11 10 pm

Repro

@jeanbza
Copy link
Author

jeanbza commented Mar 18, 2018

Just to add to this - this seems to be under load conditions only. I'm unable at the moment to reproduce that load, so I can't double check these results. Towards the end of next week I may be able to reproduce that load, in which case I'll update this issue (if it hasn't been updated / resolved by then).

@jeanbza
Copy link
Author

jeanbza commented Mar 18, 2018

I was able to repro the load and the crashing happened again. The counter in this screenshot is not the same counter as before (I removed the deployment entirely and made a brand new one, restarting the counter at 0):

screen shot 2018-03-17 at 5 44 18 pm

Note the dips in CPU to 0.

Here are the associated "load" logs:

screen shot 2018-03-17 at 5 45 38 pm

@pongad pongad added the api: pubsub Issues related to the Pub/Sub API. label Mar 18, 2018
@pongad
Copy link
Contributor

pongad commented Mar 19, 2018

I figured out the problem. By default, the Java client caps the amount of "pending bytes". In my testing, each messages are small, so there were probably millions of messages in flight at a given time.

Now, the problem is that receiveMessage creates a new Timer. Each instance of Timer creates a new thread. So you were creating millions of threads on the machine. On my machine, I wasn't even able to call ps to get how much memory the Java process was consuming; the kernel was so busy it couldn't create a new process!

The fix: Create only one Timer for MyReceiver instance, then use it every time.

@pongad pongad closed this as completed Mar 19, 2018
@jeanbza
Copy link
Author

jeanbza commented Mar 19, 2018

Doh - thanks @pongad, testing now

@jeanbza
Copy link
Author

jeanbza commented Mar 21, 2018

Hi @pongad - I've refactored out the creation of Timer (see updated gist) and am unfortunately still running into issues:

screen shot 2018-03-20 at 10 32 49 pm

screen shot 2018-03-20 at 10 32 57 pm

Especially worrying is E SEVERE: LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information. and E java.lang.OutOfMemoryError: Java heap space.

Any ideas on what's going on, or what else I can fix? Or do you reckon this is a library issue?

@jeanbza jeanbza reopened this Mar 21, 2018
@pongad
Copy link
Contributor

pongad commented Mar 23, 2018

Hmm.....

Could you add

subscriberBuilder.setFlowControlSettings(FlowControlSettings.newBuilder().setMaxOutstandingElementCount(100L).build())

?

The default sets the max pending bytes to be a percentage of available memory. I might be possible that the subscriber somehow thinks it has much more memory than it does.

If this doesn't help, I think this is a problem with gRPC or Kubernetes (I'm unable to repro on my dev machine, will try to repro on GCE) and we should ask them for help.

@jeanbza
Copy link
Author

jeanbza commented Mar 23, 2018

Hey @pongad - I've tried your suggestion. This removes the errors which is 👍 , but the ingestion is quite weird:

screen shot 2018-03-23 at 1 33 06 pm

As you can see, every 30 minutes it stops receiving messages for ~20m. I know that streams are indeed cancelled every 30m, but I would expect the reconnect to happen almost immediately as it does in Go.

So the two takeaways here are:

  • What's up with this 20m not-receiving-messages problem?
  • Is there something up in GKE that makes Java think it has more memory than it does?

@pongad
Copy link
Contributor

pongad commented Mar 27, 2018

What's up with this 20m not-receiving-messages problem?

I think this is the same problem I sent you an email about. Let's discuss there first.

Is there something up in GKE that makes Java think it has more memory than it does?

I'm not totally sure. The Subscriber class uses Runtime::maxMemory. Could it be that GKE creates the JVM will a lot more memory than it's prepared to give?

Sorry to keep giving work to you, would it be possible make your instance log what maxMomory returns?

@jeanbza
Copy link
Author

jeanbza commented Apr 2, 2018

Hey @pongad, I get 940703744. Assuming that's bytes, I think that's about 1gb. What do you reckon is going on here - why is the program going above 1gb in memory?

@jeanbza
Copy link
Author

jeanbza commented Apr 10, 2018

ping @pongad - is this release blocking?

@pongad
Copy link
Contributor

pongad commented Apr 11, 2018

Sorry I lost track of this. I think the fix is easy enough that we might as well consider it release blocking.

@jeanbza jeanbza added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. release blocking priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Apr 11, 2018
@pongad
Copy link
Contributor

pongad commented Apr 12, 2018

This should have been fixed by #3147

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

2 participants