Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block propagation metrics. #2476

Merged
merged 3 commits into from
Oct 15, 2021
Merged

Block propagation metrics. #2476

merged 3 commits into from
Oct 15, 2021

Conversation

karknu
Copy link
Contributor

@karknu karknu commented Mar 11, 2021

Depends on IntersectMBO/ouroboros-network#2979

This PR implements 6 gauges and counters with data from the Blockfetch client.
All these gauges are derived from the first time we download a block for a given slot. Other blocks for that slot, or even the same block for that slot is ignored.

cardano_node_metrics_blockfetchclient_blockdelay_cdfOne, cardano_node_metrics_blockfetchclient_blockdelay_cdfThree, cardano_node_metrics_blockfetchclient_blockdelay_cdfFive
The gauges attempts to estimate the probability that we manage to download a block within 1s, 3s or 5s. 95% of blocks within 3s is a soft limit for a well operating node. 95% of all blocks within 5s is a hard limit for praos security.

cardano_node_metrics_blockfetchclient_blockdelay_s
This gauge estimated delay from when the block should have been forged until it was received by the local blockfetch client. This metric does not include the time it took for consensus/ledger to adopt the block (and it is possible that they never will adopt it).

cardano_node_metrics_blockfetchclient_blocksize
This gauge tracks the block size. This is the block size of the first time we downloaded a block for a given slot. due to forks it may not be the size of the block that ended up on the stable part of the chain. For that you need db-sync.

cardano_node_metrics_blockfetchclient_lateblocks
This counter tracks blocks that arrived more than 5s after they where supposed to have been forged.

@njd42
Copy link

njd42 commented Mar 11, 2021

I see the calculation of 'success counts' here (i.e <1, <2, <5s) - but not 'failure' counts - should we also capture the count of those that are >5s? That would be a measure (or at least could become one) of the "data diffusion not timely enough" hazard being armed.

Copy link

@njd42 njd42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great as a tracking mechanism to track the data-diffusion-timeliness hazard; if we were to also track the >=5s arrival times that would potentially give deeper insight into outlier issues (e.g. misconfigured block producers with excessive clock skew, etc);

cardano-node/src/Cardano/Tracing/Tracers.hs Outdated Show resolved Hide resolved
@karknu
Copy link
Contributor Author

karknu commented Mar 11, 2021

I've added a counter for tracking the number of blocks that arrive more than 5s late.

Copy link

@njd42 njd42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - accepting that the duration is a bit arbitrary and should be derived from chain details eventually; that's not critical at this point.

cdf2s <- STM.readTVar cdf2sVar
cdf5s <- STM.readTVar cdf5sVar
return (fromIntegral cdf1s / fromIntegral size, fromIntegral cdf2s / fromIntegral size,
fromIntegral cdf5s / fromIntegral size)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming that the overall fraction of blocks received is tracked elsewhere - i.e. here it is a fraction, somewhere else we know the total population size...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is tracked in slotMapVar which contains the k/2 latest blocks.

@karknu karknu force-pushed the karknu/blockdelay branch 4 times, most recently from 91d432d to 4500594 Compare March 16, 2021 14:54
@karknu karknu marked this pull request as ready for review March 16, 2021 15:20
Copy link
Contributor

@dcoutts dcoutts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm. Grumble. This seems quite expensive.

At minimum I think we should pull the CDF tracking out into a trace transformer so it's not all open-coded within the block fetch tracer.

We should stick TODOs on it to revisit this for efficiency once we have the improved metrics route.

The use of a finite map for a rolling buffer is overkill. At most it needs a FIFO queue. There are cheap implementations of that with O(1) amortised insert and delete.

return True
else return False
when fresh $
sendEKGDirectDouble ekgDirect "cardano.node.metrics.blockfetchclient.blockdelay.s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an EKG label metric rather than a guage metric (e.g. measured in microseconds or something)?

And do we really have to do an EKG store string lookup every time? 😬 Maybe that's just the way it is until we replace the EKG metrics reporting scheme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubles can't be represented directly, so we have to dumb them down to strings -- gauges would be nicer, as they are integral.

Comment on lines 707 to 724
sendEKGDirectDouble ekgDirect "cardano.node.metrics.blockfetchclient.blockdelay.s"
$ realToFrac delay
sendEKGDirectDouble ekgDirect "cardano.node.metrics.blockfetchclient.blockdelay.cdf1s"
cdf1s
sendEKGDirectDouble ekgDirect "cardano.node.metrics.blockfetchclient.blockdelay.cdf2s"
cdf2s
sendEKGDirectDouble ekgDirect "cardano.node.metrics.blockfetchclient.blockdelay.cdf5s"
cdf5s
when (delay > 5) $
sendEKGDirectCounter ekgDirect "cardano.node.metrics.blockfetchclient.lateblocks"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's painful that this is so slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO added for when there is a faster way.

Copy link
Contributor

@deepfire deepfire Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcoutts, @karknu, would you expect the new APIs to provide you with a way to acquire the EKG object once, and then repeatedly post updates to it?

We only currently consider ways to improve the performance without paying the (synchronisation) cost of modifyMVar (or equivalents) -- but we're not exposing EKG objects directly. We still expect to be performing a namespace map lookup per post.

The reason we didn't go for direct EKG exposure is for two reasons:

  1. it would make for a less abstract, less uniform API
  2. we don't expect a single map lookup (which is what the new version should boil down to) to be expensive -- the CPU sync/RTS cost of modifyMVar won't be there

Would you consider such a design unacceptable and would you like, instead, to (as mentioned above) operate on an once-acquired EKG object directly?

slotMap <- STM.readTVar slotMapVar
case SMap.lookup slotNo slotMap of
Nothing -> do
let slotMap' = SMap.insert slotNo delay slotMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that SMap here is the non-strict Data.Map. We probably want the Data.Map.Strict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched the SMap import to use Data.Map.Strict.

@karknu
Copy link
Contributor Author

karknu commented Mar 18, 2021

The use of a finite map for a rolling buffer is overkill. At most it needs a FIFO queue. There are cheap implementations of that with O(1) amortised insert and delete.

A finite size map is what is needed. Blocks don't arrive in order and we may attempt to fetch the same block more than once and in that case we only want to track the first to finish.

@dcoutts
Copy link
Contributor

dcoutts commented Mar 18, 2021

A finite size map is what is needed. Blocks don't arrive in order and we may attempt to fetch the same block more than once and in that case we only want to track the first to finish.

Fair enough. Then we'd want a priority queue.

@karknu
Copy link
Contributor Author

karknu commented Mar 18, 2021

Fair enough. Then we'd want a priority queue.

Agreed, a priority queue would work.

@karknu karknu force-pushed the karknu/blockdelay branch 2 times, most recently from 5894b89 to 60997d5 Compare March 25, 2021 12:40
@karknu karknu requested a review from dcoutts March 25, 2021 13:01
@karknu karknu force-pushed the karknu/blockdelay branch 2 times, most recently from 1b2ae57 to f6b0f00 Compare March 30, 2021 08:36
cardano-node/src/Cardano/Tracing/Tracers.hs Show resolved Hide resolved
cardano-node/src/Cardano/Tracing/Tracers.hs Show resolved Hide resolved
case SMap.lookup slotNo slotMap of
Nothing -> do
let slotMap' = SMap.insert slotNo delay slotMap
if SMap.size slotMap' > 1080 -- TODO k/2, should come from config file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open an issue for it, and add the issue number here.

Copy link
Contributor

@coot coot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments from me, other than that LGTM.

Do we have an issue (I'd prefer that over a TODO comment) that tracks:

We should stick TODOs on it to revisit this for efficiency once we have the improved metrics route.

@deepfire is metrics performance be improved by the new logging system?

cardano-node/src/Cardano/Tracing/Tracers.hs Outdated Show resolved Hide resolved
cardano-node/src/Cardano/Tracing/Tracers.hs Outdated Show resolved Hide resolved

decCdfs :: NominalDiffTime -> Int -> STM ()
decCdfs delay size =
decCdfCounter delay size cdf1sVar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicer indentation:

Suggested change
decCdfCounter delay size cdf1sVar
decCdfCounter delay size cdf1sVar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest of the file uses two space indentation.

@deepfire
Copy link
Contributor

deepfire commented Sep 7, 2021

@deepfire is metrics performance be improved by the new logging system?

Yes, in two respects:

  1. We're only doing namespace manipulations once per reconfiguration, not per trace call
  2. We will either (depending on how our ongoing investigation concludes):
    2.1 Avoid using a lock to get the EKG object, as done currently
    2.2 Provide direct EKG object access, so you can post to it directly

In any case, we're currently working on benchmarking of the logging system -- and both systems will be directly comparable for some time (for as long as we don't remove the legacy one -- the transition period).

@karknu karknu force-pushed the karknu/blockdelay branch 2 times, most recently from a66659b to 1a8d6c9 Compare September 20, 2021 19:14
Copy link
Contributor

@dcoutts dcoutts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Calculate the CDF for 1s, 3s and 5s for the last 1080 blocks which
corresponds to about 6h for mainnet.
A gauge for block sizes seen over the blockfetch protocol.
@karknu
Copy link
Contributor Author

karknu commented Oct 15, 2021

bors r+

@iohk-bors
Copy link
Contributor

iohk-bors bot commented Oct 15, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants