consensus/clique: replace static 1/2 difficulties with dynamic 1-n scale #166

jmank88 · 2018-05-08T16:47:42Z

DO NOT MERGE - REQUIRES RESET

This PR proposes replacing the static clique difficulties with dynamic, scaled values derived from the last signed block.

The existing clique consensus protocol uses two static difficulties, 2 for the 'in-turn' signer, and 1 for 'out-of-turn' signers. This prioritizes in-turn signing over out-of-turn. However, it does not distinguish between 'out-of-turn' signers, and 'in-turn' is based only on the block number, with no consideration of recent history. This leads to a few problems:

When two or more nodes try to sign 'out-of-turn', there is no clear priority since they all have difficulty 1. This ambiguity seems to occasionally lead to a kind of split-decision logical deadlock in our testnet.
When less than all n nodes are signing (x are down), a smooth n-x round-robin is likely not possible, because when a node fills in out-of-turn, it may make itself ineligible (too recent) for it's own next in-turn block, requiring another out-of-turn signature. This effect can cascade or repeat depending on chance and the least common multiple of n and n-x.
When a signer is added, the period of the 'in-turn' schedule changes, making it possible for nodes to be 'in-turn' for two consecutive blocks (or for two blocks too near each other).

These problems can be avoided by using a distinct, dynamic, scaled difficulty, based on the last block signed by each signer. From CalcDifficulty:

// Difficulty for ineligible signers (too recent) is always 0. For eligible signers, difficulty is defined as 1 plus the
// number of lower priority signers, with more recent signers have lower priority. If multiple signers have not yet
// signed (0), then addresses which lexicographical sort later have lower priority.

The most recent n/2 signers are ineligible to sign, so this produces difficulties from n/2+1 to n, inclusive, with the 'in-turn' signer always having difficulty n. This has several benefits which solve or reduce the aforementioned problems:

Each signer always has a distinct difficulty, falling back to lexicographical sort when no blocks have been signed.
Because the 'in-turn' signer is the node which signed least recently, when less than all n nodes are signing, a graceful n-x round-robin schedule will still be prioritized, with random out-of-turn signatures only shifting or reordering the schedule.

jmank88 · 2018-05-08T17:32:58Z

consensus/clique/snapshot.go

@@ -51,29 +52,27 @@ type Snapshot struct {

 	Number  uint64                      `json:"number"`  // Block number where the snapshot was created
 	Hash    common.Hash                 `json:"hash"`    // Block hash where the snapshot was created
-	Signers map[common.Address]struct{} `json:"signers"` // Set of authorized signers at this moment
+	Signers map[common.Address]uint64   `json:"signers"` // Each authorized signer at this moment and their most recently signed block


This actually becomes simpler, since we drop Recents, but it may also be incompatible and require a chain reset.

jmank88 · 2018-05-08T17:34:33Z

consensus/clique/snapshot_test.go

 				{signer: "A", voted: "C", auth: false},
 				{signer: "B", voted: "C", auth: false},
-				{signer: "A", voted: "B", auth: false},
+				{signer: "A", voted: "D", auth: true},


This test was identical to the previous. This is my best guess at the original intention, based on the comment.

jmank88 · 2018-05-08T17:43:13Z

consensus/clique/snapshot.go

 		Tally:    make(map[common.Address]Tally),
 	}
 	for _, signer := range signers {
-		snap.Signers[signer] = struct{}{}
+		snap.Signers[signer] = 0


0 means 'no blocks signed' - we have to be sure to handle this specially and not interpret it as 'signed the genesis block', so that the initial n/2 blocks can be signed. I'm still not certain if it's important to assign distinct difficulties for those initial blocks, but it would be trivial to use the old algorithm.

Do we have a unit test for this case?

jmank88 · 2018-05-08T17:48:56Z

consensus/clique/clique.go

 )

 const (
 	checkpointInterval = 1024 // Number of blocks after which to save the vote snapshot to the database
 	inmemorySnapshots  = 128  // Number of recent vote snapshots to keep in memory
 	inmemorySignatures = 4096 // Number of recent block signatures to keep in memory

-	wiggleTime = 500 * time.Millisecond // Random delay (per signer) to allow concurrent signers
+	recentSignerDelay = 1 * time.Second // Full delay for most recent eligible signer.


Raised since we are doing fractions of this value, instead of multiples (though perhaps this is unwise).

jmank88 · 2018-05-08T17:52:21Z

consensus/clique/clique.go

+			// A difficulty <= limit would be too recent; limit+1 is the most recent eligible signer.
+			// So by subtracting limit, limit+1 becomes 1, which is a full delay.
+			fraction := diff - limit
+			delay = recentSignerDelay / time.Duration(fraction)


This looks off-by-one and I need to revisit, but the general idea is to order these delays by difficulty, since they were random before. This fractional solution deals with the the scaled difficulties nicely, since the delay asymptotically approaches 0. However, I'm thinking that the maximum delay should still be based on the number of signers (like before), so they don't get so crammed together as we scale up. I'm not really sure of the importance though, since the distinct difficulties should resolve conflicts immediately.

Duh, these only apply to diff < n, which is then shifted, so the scale isn't really relevant. I will rework this. We can distribute at most n/2 linearly into the range used before or something like that.

Reworked into the much simpler: delay = time.Duration(n-diff) * wiggleTime, which is a maximum delay of n/10 seconds for the most recent eligible signer (with the current wiggleTime of 200ms).

jmank88 · 2018-05-09T13:15:23Z

consensus/clique/clique.go

 )

 const (
 	checkpointInterval = 1024 // Number of blocks after which to save the vote snapshot to the database
 	inmemorySnapshots  = 128  // Number of recent vote snapshots to keep in memory
 	inmemorySignatures = 4096 // Number of recent block signatures to keep in memory

-	wiggleTime = 500 * time.Millisecond // Random delay (per signer) to allow concurrent signers
+	wiggleTime = 200 * time.Millisecond // Delay step for out-of-turn signers.


Restored to a multiplier, but with a reduced value since we have faster blocks.

benbjohnson

Overall this lgtm. I added mostly stylistic comments.

benbjohnson · 2018-05-09T15:14:21Z

consensus/clique/clique.go

-		return errUnauthorized
+	signed, authorized := snap.Signers[signer]
+	if !authorized {
+		return fmt.Errorf("%s not authorized to sign", signer.Hex())
 	}


Can you change signed to something more clear? e.g. lastSignedBlockNumber. Right now signed seems like it would be a bool.

benbjohnson · 2018-05-09T15:14:38Z

consensus/clique/clique.go

-		return nil, errUnauthorized
+	signed, authorized := snap.Signers[signer]
+	if !authorized {
+		return nil, fmt.Errorf("%s not authorized to sign", signer.Hex())
 	}


Update name of signed variable here too.

benbjohnson · 2018-05-09T15:22:21Z

consensus/clique/snapshot.go

+		if signed > 0 {
+			limit := uint64(len(snap.Signers)/2 + 1)
+			if next := limit + signed; number < next {
+				return nil, fmt.Errorf("%s not authorized to sign %d: signed %d, next eligible signature %d", signer.Hex(), number, signed, next)
 			}


Can we move the next calculation to snap so it's not duplicated and clearer? e.g. NextSignableBlockNumber(lastSignedBlockNumber uint64) uint64

benbjohnson · 2018-05-09T15:24:53Z

consensus/clique/snapshot_test.go

+	for name, tt := range tests {
+		t.Run(name, tt.run)
+	}
+}


Converting tests from a slice to a map is going to make tests run in a different order every time. Not the biggest deal but another alternative would be to put a name string in the tests struct{} definition.

jmank88 · 2018-05-09T15:37:33Z

I'm beginning to think that letting the difficulty scale up unbounded could be problematic, but also that we can work around it, and cap difficulty to 2n, by essentially setting any >n to n+node_#. We would still favor the least recent in the normal case and during small hiccups, but in more irregular cases we'd have a logical set of 'least-recent's to choose from, but I don't think that's a problem. This may also pair well with the policy for the initial n blocks after genesis as well.

jmank88 · 2018-05-10T15:23:21Z

I've been resisting assigning simple difficulties from [1,n] since I was thinking it would require sorting (and possibly fetching) all the other signers, and the simplicity of diff = current - last was appealing and adequate. However, it turns out not to be necessary to fetch and sort (already have them all, and can just iterate once and count) and the 'simpler' calculation has messy edge cases anyways. Instead, we can just assign difficulties from the range [1,n], corresponding to a sort order based on recency (and lexicographical sort, when not signed yet). With this model, we keep a narrow range of sequential difficulties capped at n, and there are never any ambiguous, equal values.

I will update the OP.

benbjohnson · 2018-05-11T17:40:21Z

lgtm 👍

…last signed

rlegene · 2019-01-18T15:04:45Z

I am in favour of this code, though, I need a way to fork my existing network into accepting any new block validation algorithm.

jmank88 · 2019-01-21T14:32:59Z

@rlegene You can try this (arguably a bug in the client - background here) and this (less significant, just makes a random choice a little more deterministic) - they only adjust how to handle same-difficulty blocks, and don't modify the protocol at all, so existing clients can upgrade without a fork.

jmank88 force-pushed the scaled-difficulty branch from 9623df8 to 26eae68 Compare May 8, 2018 17:31

jmank88 commented May 8, 2018

View reviewed changes

jmank88 force-pushed the scaled-difficulty branch from 006fd8f to 9646f57 Compare May 9, 2018 13:15

jmank88 commented May 9, 2018

View reviewed changes

jmank88 requested review from treeder, benbjohnson, rkononov and guilhermebr May 9, 2018 13:54

benbjohnson approved these changes May 9, 2018

View reviewed changes

jmank88 force-pushed the scaled-difficulty branch from 0415bcd to c9e765f Compare May 10, 2018 15:36

jmank88 changed the title ~~WIP: consensus/clique: replace static 1/2 difficulties with distance from last signed~~ consensus/clique: replace static 1/2 difficulties with distance from last signed May 10, 2018

jmank88 added the breaking-change label May 10, 2018

jmank88 changed the title ~~consensus/clique: replace static 1/2 difficulties with distance from last signed~~ consensus/clique: replace static 1/2 difficulties with dynamic 1-n scale May 10, 2018

jmank88 added 4 commits May 11, 2018 17:23

consensus/clique: replace static 1/2 difficulties with distance from …

6fa8578

…last signed

consensus/clique: simplify out-of-turn signing delay

fbaed0f

consensus/clique: PR comments cleanup

9c1abf4

consensus/clique: reduce unbounded difficulties to fixed range

2dd2bac

jmank88 force-pushed the scaled-difficulty branch from c9e765f to 2dd2bac Compare May 11, 2018 22:27

jmank88 merged commit b9d22be into master May 11, 2018

jmank88 deleted the scaled-difficulty branch May 11, 2018 22:34

This was referenced May 12, 2018

Reduce wasteful block signing #36

Closed

test what happens when nodes go offline in PoA #13

Closed

jmank88 mentioned this pull request Jan 8, 2019

PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock? ethereum/go-ethereum#18402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consensus/clique: replace static 1/2 difficulties with dynamic 1-n scale #166

consensus/clique: replace static 1/2 difficulties with dynamic 1-n scale #166

jmank88 commented May 8, 2018 •

edited

Loading

jmank88 May 8, 2018

jmank88 May 8, 2018

jmank88 May 8, 2018

benbjohnson May 9, 2018

jmank88 May 8, 2018 •

edited

Loading

jmank88 May 8, 2018 •

edited

Loading

jmank88 May 8, 2018 •

edited

Loading

jmank88 May 9, 2018 •

edited

Loading

jmank88 May 9, 2018

benbjohnson left a comment

benbjohnson May 9, 2018

benbjohnson May 9, 2018

benbjohnson May 9, 2018

benbjohnson May 9, 2018

jmank88 commented May 9, 2018

jmank88 commented May 10, 2018

benbjohnson commented May 11, 2018

rlegene commented Jan 18, 2019

jmank88 commented Jan 21, 2019

consensus/clique: replace static 1/2 difficulties with dynamic 1-n scale #166

consensus/clique: replace static 1/2 difficulties with dynamic 1-n scale #166

Conversation

jmank88 commented May 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmank88 May 8, 2018 • edited Loading

Choose a reason for hiding this comment

jmank88 May 8, 2018 • edited Loading

Choose a reason for hiding this comment

jmank88 May 8, 2018 • edited Loading

Choose a reason for hiding this comment

jmank88 May 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benbjohnson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmank88 commented May 9, 2018

jmank88 commented May 10, 2018

benbjohnson commented May 11, 2018

rlegene commented Jan 18, 2019

jmank88 commented Jan 21, 2019

jmank88 commented May 8, 2018 •

edited

Loading

jmank88 May 8, 2018 •

edited

Loading

jmank88 May 8, 2018 •

edited

Loading

jmank88 May 8, 2018 •

edited

Loading

jmank88 May 9, 2018 •

edited

Loading