Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pause timer calculation. #130

Merged
merged 10 commits into from
May 10, 2022
Merged

Fix pause timer calculation. #130

merged 10 commits into from
May 10, 2022

Conversation

heifner
Copy link
Member

@heifner heifner commented Apr 26, 2022

Fix for the long-standing issue with not correctly accounting for WASM load times. We pause the billing timer when loading WASMs. However, we were not using the correct deadline after restarting the paused timer which sometimes would cause the transaction to fail even though it had not actually run its allotted time.

Previous code:
deadline was the user provided deadline which was either the transaction deadline calculated from max-transaction-time or the wall clock block deadline passed in from the producer_plugin. When setting the timer again, after pausing, it would restrict the run time to the deadline even if the deadline was the original transaction deadline as opposed to the block deadline. This would restrict the amount of time the transaction was allowed to run on "first call" (when WASM loaded). The transaction could then fail because it ran "too long" even though it had not actually executed for max-transaction-time.

Now:
Node configurable wall clock block deadline (configurable via cpu-effort-percent) and node configurable max-transaction-time are now provided to the transaction processor (transaction_context) so when pausing for WASM load times, the wall clock block deadline can be honored separately from the max-transaction-time.

@heifner heifner requested a review from arhag April 26, 2022 00:46
Copy link
Contributor

@swatanabe swatanabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use more explanation. Even now, I don't understand what's wrong with the old code and how this change fixes it.

libraries/chain/transaction_context.cpp Outdated Show resolved Hide resolved
unittests/api_tests.cpp Show resolved Hide resolved
unittests/api_tests.cpp Show resolved Hide resolved
unittests/api_tests.cpp Outdated Show resolved Hide resolved
unittests/api_tests.cpp Outdated Show resolved Hide resolved
@heifner
Copy link
Member Author

heifner commented Apr 26, 2022

This could use more explanation. Even now, I don't understand what's wrong with the old code and how this change fixes it.

Previous code:
deadline was the user provided deadline which often is the transaction deadline passed in from the producer_plugin. When setting the timer again to this deadline it did not add in additional time for when it was paused. This restricts the amount of time the transaction is allowed to run on "first call" (when WASM loaded).

I added this to the description.

@swatanabe
Copy link
Contributor

Extending the deadline from producer_plugin is not correct. It's reasonable if the deadline is from --max-transaction-time, but not if it's because we need to ship out a block on time.

@heifner
Copy link
Member Author

heifner commented Apr 26, 2022

Extending the deadline from producer_plugin is not correct. It's reasonable if the deadline is from --max-transaction-time, but not if it's because we need to ship out a block on time.

That is why it checks block_cpu_deadline.

@swatanabe
Copy link
Contributor

block_cpu_deadline doesn't include the pauses from previous transactions does it?

@heifner
Copy link
Member Author

heifner commented Apr 26, 2022

block_cpu_deadline doesn't include the pauses from previous transactions does it?

It does not, it is the deadline calculated at the start of the transaction. We want to honor that deadline regardless how long we are paused. We want to make sure we ship a block out on time. The block_cpu_usage_exceeded will cause the producer_plugin to try the transaction in the next block.

@swatanabe
Copy link
Contributor

swatanabe commented Apr 26, 2022

I agree that respecting block_cpu_deadline is necessary, but the block deadline coming from producer plugin is more restrictive and we're discarding some of that by using only block_cpu_deadline. block_cpu_deadline is mostly objective, allowing only this transaction to decrease the available time subjectively. caller_deadline is a wall-clock time limit and includes all time since the start of the block, billed or not.

@heifner
Copy link
Member Author

heifner commented Apr 26, 2022

I agree that respecting block_cpu_deadline is necessary, but the block deadline coming from producer plugin is more restrictive and we're discarding some of that by using only block_cpu_deadline. block_cpu_deadline is mostly objective, allowing only this transaction to decrease it subjectively. caller_deadline is a wall-clock time limit and includes all time since the start of the block, billed or not.

Often the producer plugin deadline is less restrictive because it is wall clock and not block cpu, but you are correct that it could be more restrictive, especially if cpu effort is set low. I was trying to avoid passing into push_transaction an indication if this was a block deadline.

I do need to honor that however. I'll put in some more work on this. Thanks for your thorough review.

@swatanabe
Copy link
Contributor

Summary of the problem as I understand it:

  • transaction_checktime_timer manages two distinct deadlines, a wall clock deadline, and a billing clock deadline
  • the limit set by the command line option --max-transaction-time is incorrectly treated as a wall-clock limit.

@heifner
Copy link
Member Author

heifner commented Apr 26, 2022

Summary of the problem as I understand it:

  • transaction_checktime_timer manages two distinct deadlines, a wall clock deadline, and a billing clock deadline
  • the limit set by the command line option --max-transaction-time is incorrectly treated as a wall-clock limit.

Nicely stated. I'll look at this some more, but maybe the correct thing would be to provide push_transaction with the max-transaction-time and a block deadline and then let transaction_context do all this calculation. This would simplify the producer_plugin and move all this into transaction_context.

heifner added 3 commits April 29, 2022 20:16
…or the wall clock block deadline. Pass both block_deadline and max_transaction_time to transaction_context so it can honor both of these correctly.
@heifner heifner requested a review from swatanabe May 4, 2022 17:56
libraries/chain/transaction_context.cpp Outdated Show resolved Hide resolved
libraries/chain/transaction_context.cpp Outdated Show resolved Hide resolved
libraries/chain/transaction_context.cpp Show resolved Hide resolved
unittests/api_tests.cpp Outdated Show resolved Hide resolved
unittests/api_tests.cpp Show resolved Hide resolved
BOOST_CHECK( cpu_limit <= 150'000 );

// Verify that the restriction on the transaction of 24'999 is honored even though there is wall clock time to
// load the wasm. If this test fails it is possible that the wasm loaded faster or slower than expected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need two tests

  • The 24999 limit is extended by a pause (this check)
  • The block deadline is not extended by a pause (I don't see this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and one more:

  • The cpu limit is is originally before the block deadline, but is extended past it by a pause.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests.

@heifner heifner requested a review from swatanabe May 7, 2022 20:19
// If this check fails but duration is >= 25'000 (previous check did not fail), then the check here is likely
// because it took longer than 2.5 ms for checktime to trigger, trace to be created, and to get to the now() call.
// Timer is paused during WASM load, so checktime will not trigger until after WASM is loaded.
BOOST_CHECK( dur < 150'000 ); // should not run to max_transaction_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too long and does not match the comment of 2.5 ms. For this test, I would expect to see:

  • The base timeout (25 ms) is an upper bound on the wasm load time
  • The expected duration is an upper bound on timeout + checktime overhead
  • The expected duration is a lower bound on timeout + WASM compile time (not satisfied)

A high value here means that this test will always pass even if the block deadline is extended.

Copy link
Contributor

@swatanabe swatanabe May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If random noise makes this test too flaky, then we might have to give up on testing this condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated this test into two. One which gives time for the WASM to load but is limited to block deadline. This test I tweaked the values and comments to more closely reflect this particular test. Turns out on my machine I was seeing WASM load times of 35ms.

@heifner heifner requested a review from swatanabe May 10, 2022 02:02
@@ -1000,8 +1000,8 @@ BOOST_AUTO_TEST_CASE(checktime_pause_max_trx_cpu_extended_test) { try {
// This assumes that loading the WASM takes at least 1.5 ms
// If this check fails but duration is >= 24'999 (previous check did not fail), then the check here is likely
// because WASM took less than 1.5 ms to load.
BOOST_CHECK( dur > 26'500 );
BOOST_CHECK( dur < 150'000 ); // Should not run to block_cpu_usage deadline
BOOST_CHECK_MESSAGE( dur > 26'500, "elapsed " << std::to_string(dur) << "us" );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOOST_TEST will print dur automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api_tests.cpp(1102): error: in "api_tests/checktime_pause_block_deadline_not_extended_test": check dur < 30'000 has failed It wasn't for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd. That's what I'd expect from BOOST_CHECK. BOOST_TEST is supposed to decompose the expression and has been the recommended macro to use for a long time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you meant the to_string is not needed. Removed the to_string calls.

Copy link
Contributor

@swatanabe swatanabe May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mandel/unittests/api_tests.cpp(1003): error: in "api_tests/checktime_pause_max_trx_cpu_extended_test": check dur < 26'500 has failed [28130 >= 26500]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOOST_TEST( dur < 26'500 );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you get it to print that? Some special command line arg to unit_test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you said BOOST_TEST. Ah, need to switch to using that in the future. Thanks.

@heifner heifner merged commit 8cea166 into main May 10, 2022
@heifner heifner deleted the fix-pause-timer branch May 10, 2022 12:28
@heifner heifner added the OCI OCI working this issue... label May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCI OCI working this issue...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants