Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

[PAN-2786] Stop Transaction Pool Queue from Growing Unbounded #1586

Merged

Conversation

AbdelStark
Copy link
Contributor

@AbdelStark AbdelStark commented Jun 20, 2019

PR description

The time based policy has been removed. A follow up PR will be created to include a timeout in TransactionsMessageProcessor and to skip message if the message has expired.

  • implement a custom bounded queue
  • use a time based policy with keep alive configuration
  • implement eviction process based on the policy
  • add metrics
  • expose a method in MonitoredExecutors to create a working queue with a maximum capacity
  • update EthScheduler to use a limited working queue for the txWorkerExecutor

Fixed Issue(s)

- use a `ArrayBlockingQueue` with a fixed size to limit the transaction task queue
- expose a method in `MonitoredExecutors` to create a working queue with a maximum capacity
- update `EthScheduler` to use a limited working queue for the `txWorkerExecutor`
@ajsutton
Copy link
Contributor

Wouldn’t this cause exceptions to be thrown or the vertx/netty thread to block once the executor queue fills up? We’d want to discard the oldest messages when the queue is full because they’re likely full off outdated transactions and we want to process the most recent.

We also want to ensure there’s good logging and metrics to indemnify when this happens since losing transactions is a bad thing for a lot of cases.

Finally the queue size of 128 is far too small. Take a look at the queue sizes and incoming message rate for the MainNet Fast i3.large box - it gets up to 500 incoming messages/sec and only hits trouble when the queue size is in the millions. We probably don’t want to let it get that big as the messages are likely out of date by then anyway but it needs to be a few thousand at least I’d say. Having an idea of how long messages were in that queue would help to set the size appropriately.

@AbdelStark
Copy link
Contributor Author

Wouldn’t this cause exceptions to be thrown or the vertx/netty thread to block once the executor queue fills up? We’d want to discard the oldest messages when the queue is full because they’re likely full off outdated transactions and we want to process the most recent.

We also want to ensure there’s good logging and metrics to indemnify when this happens since losing transactions is a bad thing for a lot of cases.

Finally the queue size of 128 is far too small. Take a look at the queue sizes and incoming message rate for the MainNet Fast i3.large box - it gets up to 500 incoming messages/sec and only hits trouble when the queue size is in the millions. We probably don’t want to let it get that big as the messages are likely out of date by then anyway but it needs to be a few thousand at least I’d say. Having an idea of how long messages were in that queue would help to set the size appropriately.

Yes indeed it will cause exceptions to be thrown once the executor queue fills up.
Ok for the default size too small, i will increase it to few thousand. A next step will be to have this number configurable in a follow up PR.
Regarding discarding oldest messages, where is the best place to put this logic ? In the EthScheduler or in the 'TransactionsMessageHandler'. If we do this in TransactionsMessageHandler it will require to expose the list of pending tasks from the EthScheduler. And we probably would have to use a custom BlockingQueue to implement the logic based on the timestamp.

@ajsutton
Copy link
Contributor

I’d say use a bounded queue with a really big size (a million or two), include the time stamp of when the message was first received in the queues task to execute and then when the executor runs the task the first thing it does is check it it’s too old and bail out if so.

The speed difference between checking a time stamp and actually validating a transaction is so great that this mechanism will allow catching up on backlogs very quickly without anything too complex. And if for some reason we’re getting so many incoming messages that we can’t even do a long comparison on each fast enough then the client has no hope, we’ll hit the queue size limit and throw loads of exceptions. The client is unlikely to cope at that point but at least we have a lot of error messages telling us why. Definitely worth checking it throws an exception if the queue is full instead of blocking.

But the idea of using a custom queue is possibly a good one. If we did that, it would be easy for it to detect that the size limit is reached and remove from the front of the list when adding an item to the end. Logging and monitoring could all be inside the custom queue implementation then too.

AbdelStark and others added 5 commits June 21, 2019 09:03
- implement a custom bounded queue
- use a time based policy with keep alive configuration
- implement eviction process based on the policy
- add metrics
@AbdelStark
Copy link
Contributor Author

I’d say use a bounded queue with a really big size (a million or two), include the time stamp of when the message was first received in the queues task to execute and then when the executor runs the task the first thing it does is check it it’s too old and bail out if so.

The speed difference between checking a time stamp and actually validating a transaction is so great that this mechanism will allow catching up on backlogs very quickly without anything too complex. And if for some reason we’re getting so many incoming messages that we can’t even do a long comparison on each fast enough then the client has no hope, we’ll hit the queue size limit and throw loads of exceptions. The client is unlikely to cope at that point but at least we have a lot of error messages telling us why. Definitely worth checking it throws an exception if the queue is full instead of blocking.

But the idea of using a custom queue is possibly a good one. If we did that, it would be easy for it to detect that the size limit is reached and remove from the front of the list when adding an item to the end. Logging and monitoring could all be inside the custom queue implementation then too.

I have implemented a custom queue that use a time based policy when adding an element at full capacity. A keep alive parameter is used to evict transactions that have been in the working queue too much time.

- use concrete class instead of interface
- change metric name to comply with global policy
- update unit test
- wrap `Runnable` into `scheduleTxWorkerTask`
- remove time based policy
- use raw `Runnable`
- make a room for a new element at full capacity
AbdelStark and others added 4 commits June 25, 2019 09:03
- remove Mock class
- make logic more thread safe, avoid race condition
- remove element until the new one is accepted
…manager/EthScheduler.java

Co-Authored-By: Adrian Sutton <adrian@symphonious.net>
@AbdelStark AbdelStark requested a review from ajsutton June 25, 2019 07:35
Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a couple of little nits that would be good to tidy up. Thanks for your patience on this one, timezones have made it take a long time - sorry.

@AbdelStark
Copy link
Contributor Author

LGTM. Just a couple of little nits that would be good to tidy up. Thanks for your patience on this one, timezones have made it take a long time - sorry.

No worries. Thank you for the time you spent on it.

- use assertj assertions for better readability
- improve unit test
@AbdelStark AbdelStark deleted the feature/pan-2786-tx-pool-queue-bound branch August 23, 2019 16:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants