-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer hashing in staged ledger diff application #15980
base: compatible
Are you sure you want to change the base?
Defer hashing in staged ledger diff application #15980
Conversation
Problem: when performing transaction application, all merkle ledger hashes are recomputed for every account update. This is wasteful for a number of reasons: 1. Transactions normally contain more than one account update and hashes computed by an account update are overwritten by a subsequent account update 2. Ledger's depth is 35, whereas only around 2^18 are actually populated. This means that each account update induces a wasteful overhead of at least 17 hashes Solution: defer computation of hashes in mask. When an account is added, it's stacked in a list `unhashed_accounts` of masks which is processed at the time of next access to hashes. This fix improved performance of handling 9-account-update transactions by ~60% (measured on a laptop).
Defer computation of account hashes to the moment the hashes of the mask will actually be accessed. This is useful if the same account gets overwritten a few times in the same block.
03e7683
to
d8b44f3
Compare
…r-diff-application
…hashing-in-staged-ledger-diff-application
!ci-build-me |
…hashing-in-staged-ledger-diff-application
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After spending >1.5h in total looking into this PR:
- It seems like a cool PR / idea to start with.
- I have no way of judging if this change is as you intended (unfortunately). This is a dense module describing a non-trivial data structure with close-to-zero documentation. On the positive side, I think I figured out the general sentiment of what you tried to achieve, which is already good.
- I'd like to see tests for this module
- I'd like to see explicit type annotations AND comments for most internal functions (think >5-10 lines of code; small helpers can be skipped), added in this PR and those moved around. Reading this for the first time trying to guess what the functions are doing without comments/types is hard.
(offtop rant: open module T.S; module T = T.T; let type t = t; let type a x = t.a x; let rec impl a t = impl T.impl go a
. 2 step indentation (really hard to tell where code blocks end), 80 symbol wrapping, no explicit type annotations, almost no comments. This codebase is impossible to review without opening a whole IDE. I hope we're moving towards a more reader-friendly codestyle?.. 💀 )
in | ||
snd @@ List.fold_map ~init:all_parent_paths ~f self_paths | ||
|
||
let rec self_path_impl ~element ~depth address = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: unchanged.
let%map.Option rest = self_path_impl ~element ~depth parent_address in | ||
el :: rest | ||
|
||
let empty_hash = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: unchanged.
let empty_hash = | ||
Empty_hashes.extensible_cache (module Hash) ~init_hash:Hash.empty_account | ||
|
||
let self_path_get_hash ~hashes ~current_location height address = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: unchanged.
*) | ||
type unhashed_account_t = Account.t option * Location.t | ||
|
||
let sexp_of_unhashed_account_t = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need these functions? They seem to be unused.
@@ -93,6 +105,7 @@ module Make (Inputs : Inputs_intf.S) = struct | |||
This is used as a lookup cache. *) | |||
; mutable accumulated : (accumulated_t[@sexp.opaque]) option | |||
; mutable is_committing : bool | |||
; mutable unhashed_accounts : unhashed_account_t list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep a hashmap here from position to the account value? Again wondering what happens if we update A1 to A2 to A3 in a batch of 2 txs: will both A2 and A3 appear in unhashed_accounts
?
let set_inner_hash_at_addr_exn t address hash = | ||
assert_is_attached t ; | ||
assert (Addr.depth address <= t.depth) ; | ||
self_set_hash t address hash | ||
|
||
let hashes_and_ancestor t = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: mostly unchanged except for the first line. maps_and_ancestor
was left as it is, and hashes_and_ancestor
additionally runs finalize_hashes
before the sub-call.
let finalize_hashes t = | ||
let unhashed_accounts = t.unhashed_accounts in | ||
if not @@ List.is_empty unhashed_accounts then ( | ||
t.unhashed_accounts <- [] ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to first nullify the existing unhashed_accounts
before calling finalize_hashes_do
?... Would the other way around work?
It's confusing because finalize_hashes_do
also takes t
as an argument, so you pass unhashed_accounts
into it twice technically. finalize_hashes_do
could just read/nullify these internally.
in | ||
let on_snd f (_, a) (_, b) = f a b in | ||
List.stable_sort ~compare:(on_snd Location.compare) unhashed_accounts | ||
|> List.remove_consecutive_duplicates ~which_to_keep:`First |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening here? You seem to only keep the first account for each duplicated location. I assume you want to keep the last one instead, that is the last update that was pushed to the list? Or are we prepending to the list?
@@ -81,6 +81,18 @@ module Make (Inputs : Inputs_intf.S) = struct | |||
*) | |||
} | |||
|
|||
(** Type for an account that wasn't yet hashed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question regarding your comment from the PR description:
When an account is touched in a few transactions of the same block, it's gets hashed a few times, whereas in fact only the final hash is truly needed
I'm wondering if this does not change the original application logic. Assume tx1 changes A and B, and tx2 changes C and A. So A1 becomes A2 and then A3, B1 becomes B2, C1 becomes C2. Surely, after tx1 and tx2 the final account state for A is A3, but don't we /at all/ care about the intermediate Merkle tree? Can we just rehash the MT based on A3 B2 C2?
(I suspect yes, just double checking)
update_maps t ~f:(fun maps -> | ||
{ maps with hashes = Map.set maps.hashes ~key:address ~data:hash } ) | ||
|
||
let path_batch_impl ~fixup_path ~self_lookup ~base_lookup locations = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: slightly rewritten to not have t
as an argument, and to not pass hashes
to the self_lookup
parameter (it's passed/embedded into self_lookup
on the caller level).
Problem: when performing transaction application, all merkle ledger hashes are recomputed for every account update. This is wasteful for a number of reasons:
Solution: defer computation of hashes in mask. When an account is added, it's stacked in a list
unhashed_accounts
of masks which is processed at the time of next access to hashes.This fix improved performance of handling 9-account-update transactions by ~60% (measured by #14582 on top of #15979 and #15978).
Explain how you tested your changes:
Checklist: