Missing blocks due to high latency of maintenance block #504

abitmore · 2017-11-24T19:48:40Z

According to the design, there are much more work to be done at maintenance interval, so it's expected that latency of those maintenance blocks will be higher. It's exactly what happening in the production network, logs attached below. At every maintenance interval, 3 slots will be skipped, that means we have 12 seconds for a witness to generate a maintenance block and broadcast it to the next witness and the next witness should be able to validate it before generating the next block.

However, if the next witness didn't receive the maintenance block in 3 seconds, according to current implementation of witness plugin, she will consider that block as lost, so she will generate another maintenance block. This behavior is debatable.

if the block is really lost, it's good for the whole network if she generate another block, so the next witness will have 9 seconds (or 12?) to validate it and generate the next block. The total block interval will be shortest.
if the block really need 3 seconds to generate, her new block will likely reach the next witness later than the original one, so her block will be ignored, she will miss her block.
a worse scenario is that active witness list and/or witness schedule can change after the maintenance block, so the next witness after different maintenance block can be different, so it could cause forking (see this link for example).

Thoughts?

Related logs:

2017-11-23T18:00:03 th_a:Witness Block Production block_production_loo ] Generated block #22049281 with timestamp 2017-11-23T18:00:03 at time 2017-11-23T18:00:03                       witness.cpp:184
2017-11-23T18:00:03 th_a:invoke handle_block         handle_block ] Got block: #22049281 time: 2017-11-23T18:00:00 latency: 3267 ms from: verbaltech2  irreversible: 22049263 (-18)                     application.cpp:499
...
2017-11-23T19:00:00 th_a:invoke handle_block         handle_block ] Got block: #22050477 time: 2017-11-23T19:00:00 latency: 833 ms from: bhuz  irreversible: 22050458 (-19)                     application.cpp:499
...
2017-11-23T20:00:01 th_a:invoke handle_block         handle_block ] Got block: #22051674 time: 2017-11-23T20:00:00 latency: 1554 ms from: delegate.baozi  irreversible: 22051655 (-19)                  application.cpp:499
...
2017-11-23T21:00:01 th_a:invoke handle_block         handle_block ] Got block: #22052871 time: 2017-11-23T21:00:00 latency: 1235 ms from: wackou  irreversible: 22052855 (-16)                  application.cpp:499
...
2017-11-23T22:00:02 th_a:invoke handle_block         handle_block ] Got block: #22054068 time: 2017-11-23T22:00:00 latency: 2001 ms from: witness.still  irreversible: 22054049 (-19)                   application.cpp:499
...
2017-11-23T23:00:01 th_a:invoke handle_block         handle_block ] Got block: #22055265 time: 2017-11-23T23:00:00 latency: 1452 ms from: delegate.freedom  irreversible: 22055246 (-19)                        application.cpp:499
...
2017-11-24T00:00:03 th_a:invoke handle_block         handle_block ] Got block: #22056462 time: 2017-11-24T00:00:00 latency: 3104 ms from: verbaltech2  irreversible: 22056442 (-20)                     application.cpp:499
2017-11-24T00:00:03 th_a:invoke handle_block         handle_block ] Got block: #22056462 time: 2017-11-24T00:00:00 latency: 3372 ms from: verbaltech2  irreversible: 22056442 (-20)                     application.cpp:499
...
2017-11-24T01:00:01 th_a:invoke handle_block         handle_block ] Got block: #22057659 time: 2017-11-24T01:00:00 latency: 1214 ms from: xeldal  irreversible: 22057643 (-16)                  application.cpp:499
...
2017-11-24T02:00:02 th_a:invoke handle_block         handle_block ] Got block: #22058856 time: 2017-11-24T02:00:00 latency: 2005 ms from: witness.still  irreversible: 22058837 (-19)                   application.cpp:499
...
2017-11-24T03:00:01 th_a:invoke handle_block         handle_block ] Got block: #22060053 time: 2017-11-24T03:00:00 latency: 1040 ms from: blckchnd  irreversible: 22060036 (-17)                        application.cpp:499
...
2017-11-24T04:00:01 th_a:invoke handle_block         handle_block ] Got block: #22061250 time: 2017-11-24T04:00:00 latency: 1161 ms from: xn-delegate  irreversible: 22061232 (-18)                     application.cpp:499
...
2017-11-24T05:00:01 th_a:invoke handle_block         handle_block ] Got block: #22062447 time: 2017-11-24T05:00:00 latency: 1624 ms from: xman  irreversible: 22062431 (-16)                    application.cpp:499
...
2017-11-24T06:00:02 th_a:invoke handle_block         handle_block ] Got block: #22063644 time: 2017-11-24T06:00:00 latency: 2891 ms from: verbaltech2  irreversible: 22063629 (-15)                     application.cpp:499
2017-11-24T06:00:03 th_a:invoke handle_block         handle_block ] Got block: #22063644 time: 2017-11-24T06:00:00 latency: 3144 ms from: verbaltech2  irreversible: 22063630 (-14)                     application.cpp:499
...
2017-11-24T07:00:01 th_a:invoke handle_block         handle_block ] Got block: #22064840 time: 2017-11-24T07:00:00 latency: 1210 ms from: in.abit  irreversible: 22064823 (-17)                 application.cpp:499
...
2017-11-24T08:00:00 th_a:invoke handle_block         handle_block ] Got block: #22066037 time: 2017-11-24T08:00:00 latency: 867 ms from: roelandp  irreversible: 22066020 (-17)                 application.cpp:499
...
2017-11-24T09:00:01 th_a:invoke handle_block         handle_block ] Got block: #22067234 time: 2017-11-24T09:00:00 latency: 1563 ms from: abc123  irreversible: 22067216 (-18)                  application.cpp:499
2017-11-24T09:00:01 th_a:invoke handle_block         handle_block ] Got block: #22067234 time: 2017-11-24T09:00:00 latency: 1835 ms from: abc123  irreversible: 22067217 (-17)                  application.cpp:499
...
2017-11-24T10:00:01 th_a:invoke handle_block         handle_block ] Got block: #22068431 time: 2017-11-24T10:00:00 latency: 1061 ms from: roelandp  irreversible: 22068417 (-14)                        application.cpp:499
...
2017-11-24T11:00:01 th_a:invoke handle_block         handle_block ] Got block: #22069628 time: 2017-11-24T11:00:00 latency: 1617 ms from: delegate.baozi  irreversible: 22069611 (-17)                  application.cpp:499
...
2017-11-24T12:00:00 th_a:invoke handle_block         handle_block ] Got block: #22070825 time: 2017-11-24T12:00:00 latency: 960 ms from: roelandp  irreversible: 22070809 (-16)                 application.cpp:499
2017-11-24T12:00:01 th_a:invoke handle_block         handle_block ] Got block: #22070825 time: 2017-11-24T12:00:00 latency: 1230 ms from: roelandp  irreversible: 22070810 (-15)                        application.cpp:499
...
2017-11-24T13:00:03 th_a:Witness Block Production block_production_loo ] Generated block #22072022 with timestamp 2017-11-24T13:00:03 at time 2017-11-24T13:00:03                       witness.cpp:184
2017-11-24T13:00:03 th_a:invoke handle_block         handle_block ] Got block: #22072022 time: 2017-11-24T13:00:00 latency: 3269 ms from: verbaltech2  irreversible: 22072007 (-15)                     application.cpp:499
2017-11-24T13:00:03 th_a:invoke handle_block         handle_block ] Got block: #22072022 time: 2017-11-24T13:00:00 latency: 3269 ms from: verbaltech2  irreversible: 22072007 (-15)                     application.cpp:499
...
2017-11-24T14:00:00 th_a:invoke handle_block         handle_block ] Got block: #22073218 time: 2017-11-24T14:00:00 latency: 955 ms from: delegate-1.lafona  irreversible: 22073201 (-17)                        application.cpp:499
...
2017-11-24T15:00:00 th_a:invoke handle_block         handle_block ] Got block: #22074415 time: 2017-11-24T15:00:00 latency: 940 ms from: delegate-1.lafona  irreversible: 22074400 (-15)                        application.cpp:499
...
2017-11-24T16:00:00 th_a:invoke handle_block         handle_block ] Got block: #22075610 time: 2017-11-24T16:00:00 latency: 933 ms from: delegate-1.lafona  irreversible: 22075593 (-17)                        application.cpp:499
...
2017-11-24T17:00:01 th_a:invoke handle_block         handle_block ] Got block: #22076807 time: 2017-11-24T17:00:00 latency: 1162 ms from: in.abit  irreversible: 22076791 (-16)                 application.cpp:499
...
2017-11-24T18:00:01 th_a:invoke handle_block         handle_block ] Got block: #22078003 time: 2017-11-24T18:00:00 latency: 1081 ms from: delegate-1.lafona  irreversible: 22077987 (-16)                       application.cpp:499

The text was updated successfully, but these errors were encountered:

oxarbitrage · 2017-11-24T20:34:21Z

very interesting. let me ask you some stuff to understand better. is the 3 slots skipping a parameter that can be changed in case this blocks became more expensive or it is just happening because it is the time it is currently taking ? should we be be thinking on modifying witness plugin so the next witness wait 9 secs(or whatever) instead of 3 when we are at maintain time?
unsure about the technicals but maybe a flag of last block before next witness send maint block ?

on a side note there is an old issue at #211 that proposes plugins to add tasks to maintenance, something i think it will not be a good idea.

xeroc · 2017-11-24T21:34:16Z

Iirc, the number of blocks that can be skipped at most is a committee parameter and can thus be changed

oxarbitrage · 2017-11-24T21:49:44Z

thanks @xeroc found it here: https://github.com/bitshares/bitshares-core/blob/master/libraries/chain/include/graphene/chain/config.hpp#L51

abitmore · 2018-02-16T10:21:33Z

It's getting worse.

2018-02-16T10:00:04 th_a:invoke handle_block         handle_block ] Got block: #24472511 01756bbf1a1c9e6a864b8e82f4d4d6a29481e0f1 time
: 2018-02-16T10:00:03 latency: 1711 ms from: roelandp  irreversible: 24472495 (-16)                       application.cpp:500
2018-02-16T10:00:05 th_a:invoke handle_block         handle_block ] Got block: #24472511 01756bbf1a1c9e6a864b8e82f4d4d6a29481e0f1 time
: 2018-02-16T10:00:03 latency: 2330 ms from: roelandp  irreversible: 24472496 (-15)                       application.cpp:500
2018-02-16T10:00:07 th_a:invoke handle_block         handle_block ] Got block: #24472511 01756bbfb2eef6e3a8ab42010a93ce87555178e5 time
: 2018-02-16T10:00:00 latency: 7453 ms from: delegate-zhaomu  irreversible: 24472496 (-15)                        application.cpp:500
2018-02-16T10:00:09 th_a:invoke handle_block         handle_block ] Got block: #24472511 01756bbfff6fbe4e527fecbd3c686f8110fc3155 time
: 2018-02-16T10:00:06 latency: 3613 ms from: witness.still  irreversible: 24472496 (-15)                  application.cpp:500
2018-02-16T10:00:18 th_a:invoke handle_block         handle_block ] Got block: #24472512 01756bc0a9a7083bd4fdb0545b5ed162ac78f23f time
: 2018-02-16T10:00:18 latency: 683 ms from: wackou  irreversible: 24472496 (-16)                  application.cpp:500

pmconrad · 2018-02-16T13:47:12Z

Hm, any idea what's causing this?
Perhaps need to add some timing output in various places during maintenance?
Or try to run a replay in profiling mode? That might take ages...

abitmore · 2018-02-16T20:28:36Z

This is a plugin issue, although the logic is tricky.

When the witness plugin didn't receive a block in 3 seconds then it tries to produce another block, actually this can be profiled, say, if it took itself 3 seconds to produce the new block, it should know the previous witness need 3 seconds to produce and some time to propagate, so the best guess should be it should have received the block now, so best to validate it (may need time as well) and build the next block based on it, but not simply broadcast its own maintenance block (then miss its turn).

Anyway, maintenance blocks are meant to need longer time to produce, we need to figure out the best way to handle them, which is independent from what's done in the maintenance interval. Above comment is a direction.

On the other hand, you've asked, why we now need so long time to produce the block? One simple reason is we have more and more accounts, so the node need longer time to iterate through them to re-tally votes. We may need to improve this mechanism at some time, one option could be learning from Steem (update votes on every block).

abitmore · 2018-02-19T22:05:32Z

Similar new log entries:

2018-02-19T22:00:04 th_a:invoke handle_block         handle_block ] Got block: #24572044 0176f08ce4785cf6577a3202517b4bda4464e40c time: 2018-02-19T22:00:00 latency: 4013 ms from: witness.still  irreversible: 24572027 (-17)                  application.cpp:500
2018-02-19T22:00:04 th_a:invoke handle_block         handle_block ] Got block: #24572044 0176f08ce4785cf6577a3202517b4bda4464e40c time: 2018-02-19T22:00:00 latency: 4584 ms from: witness.still  irreversible: 24572027 (-17)                  application.cpp:500
2018-02-19T22:00:04 th_a:invoke handle_block         handle_block ] Got block: #24572044 0176f08c0b8aa7c3f267beb5846938b844d214e7 time: 2018-02-19T22:00:03 latency: 1843 ms from: abc123  irreversible: 24572027 (-17)                 application.cpp:500
2018-02-19T22:00:07 th_a:invoke handle_block         handle_block ] Got block: #24572044 0176f08c8bf1ba0cfe3e592fa75015908711cfed time: 2018-02-19T22:00:06 latency: 1868 ms from: sahkan-bitshares  irreversible: 24572027 (-17)                       application.cpp:500
2018-02-19T22:00:18 th_a:invoke handle_block         handle_block ] Got block: #24572045 0176f08d071472f054c0628dcf9bd6e17121d5a0 time: 2018-02-19T22:00:18 latency: 385 ms from: verbaltech2  irreversible: 24572027 (-18)                     application.cpp:500

abitmore · 2018-07-18T20:27:16Z

Now I tend to believe that current behavior / implementation is the best approach in the worst scenario, so likely won't change it.

I will close this issue if there is no new feedback.

abitmore · 2018-08-27T17:06:38Z

Copied @pmconrad's comment in #1157 (comment) to this issue since it's related:

We should define desirable behaviour before implementing fixes.

IMO, if a node hasn't received a block from its predecessor when its time slot has come, it should simply produce its own block in time.

Waiting until its own slot is nearly over is counterproductive, this will only propagate the problem to the next witness in line.

Producing a block on time is also desirable from the user's perspective, because that will get his transaction approved more quickly.

Not waiting for the previous block increases the chance of a fork between the previous block and the next. Waiting for the previous block increases the chance of a fork between this block and the next. So in the end waiting doesn't gain us anything (but increases the risk if the cause is not latency but witness failure).

There is one special case though: if the previous block has been received in time but takes a long time to apply (as is often the case in a maintenance block), then it makes sense to produce one even if our slot is nearing its end. I would extend the deadline in that special case only.

abitmore added the plugin label Nov 24, 2017

abitmore added the question label Feb 16, 2018

abitmore added the performance label Mar 2, 2018

abitmore mentioned this issue Apr 2, 2018

Improve vote tally performance #803

Closed

2 tasks

bangzi1001 mentioned this issue Jun 22, 2018

Make better use of multi-core servers #1079

Open

17 tasks

abitmore mentioned this issue Jul 29, 2018

Improve vote tally performance (long term) #1204

Open

1 task

abitmore mentioned this issue Aug 23, 2018

Check Steem issue 2658: Not producing block because node didn't wake up within 500ms of the slot time #1157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing blocks due to high latency of maintenance block #504

Missing blocks due to high latency of maintenance block #504

abitmore commented Nov 24, 2017

oxarbitrage commented Nov 24, 2017

xeroc commented Nov 24, 2017

oxarbitrage commented Nov 24, 2017

abitmore commented Feb 16, 2018

pmconrad commented Feb 16, 2018

abitmore commented Feb 16, 2018

abitmore commented Feb 19, 2018

abitmore commented Jul 18, 2018

abitmore commented Aug 27, 2018

Missing blocks due to high latency of maintenance block #504

Missing blocks due to high latency of maintenance block #504

Comments

abitmore commented Nov 24, 2017

oxarbitrage commented Nov 24, 2017

xeroc commented Nov 24, 2017

oxarbitrage commented Nov 24, 2017

abitmore commented Feb 16, 2018

pmconrad commented Feb 16, 2018

abitmore commented Feb 16, 2018

abitmore commented Feb 19, 2018

abitmore commented Jul 18, 2018

abitmore commented Aug 27, 2018