Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer hashing in staged ledger diff application #15980

Open
wants to merge 7 commits into
base: compatible
Choose a base branch
from

Conversation

georgeee
Copy link
Member

@georgeee georgeee commented Aug 26, 2024

Problem: when performing transaction application, all merkle ledger hashes are recomputed for every account update. This is wasteful for a number of reasons:

  1. Transactions normally contain more than one account update and a hash computed for an account update is sometimes overwritten by a subsequent account update
  2. Ledger's depth is 35, whereas only around 2^18 records are actually populated. This means that each account update induces an wasteful overhead of at least 17 hashes
  3. When an account is touched in a few transactions of the same block, it's gets hashed a few times, whereas in fact only the final hash is truly needed

Solution: defer computation of hashes in mask. When an account is added, it's stacked in a list unhashed_accounts of masks which is processed at the time of next access to hashes.

This fix improved performance of handling 9-account-update transactions by ~60% (measured by #14582 on top of #15979 and #15978).

Explain how you tested your changes:

Checklist:

  • Dependency versions are unchanged
    • Notify Velocity team if dependencies must change in CI
  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? None

@georgeee georgeee requested a review from a team as a code owner August 26, 2024 12:48
@georgeee georgeee self-assigned this Aug 26, 2024
Problem: when performing transaction application, all merkle ledger
hashes are recomputed for every account update. This is wasteful for a
number of reasons:

  1. Transactions normally contain more than one account update and
     hashes computed by an account update are overwritten by a
     subsequent account update
  2. Ledger's depth is 35, whereas only around 2^18 are actually
     populated. This means that each account update induces a wasteful
     overhead of at least 17 hashes

Solution: defer computation of hashes in mask. When an account is added,
it's stacked in a list `unhashed_accounts` of masks which is processed
at the time of next access to hashes.

This fix improved performance of handling 9-account-update transactions
by ~60% (measured on a laptop).
Defer computation of account hashes to the moment the hashes of the mask
will actually be accessed.

This is useful if the same account gets overwritten a few times in the
same block.
@georgeee georgeee force-pushed the georgeee/defer-hashing-in-staged-ledger-diff-application branch from 03e7683 to d8b44f3 Compare August 26, 2024 13:18
Base automatically changed from georgeee/do-not-recompute-hashes-on-ledger-commit to compatible August 26, 2024 17:23
@georgeee
Copy link
Member Author

!ci-build-me

Copy link
Member

@volhovm volhovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After spending >1.5h in total looking into this PR:

  • It seems like a cool PR / idea to start with.
  • I have no way of judging if this change is as you intended (unfortunately). This is a dense module describing a non-trivial data structure with close-to-zero documentation. On the positive side, I think I figured out the general sentiment of what you tried to achieve, which is already good.
  • I'd like to see tests for this module
  • I'd like to see explicit type annotations AND comments for most internal functions (think >5-10 lines of code; small helpers can be skipped), added in this PR and those moved around. Reading this for the first time trying to guess what the functions are doing without comments/types is hard.

(offtop rant: open module T.S; module T = T.T; let type t = t; let type a x = t.a x; let rec impl a t = impl T.impl go a. 2 step indentation (really hard to tell where code blocks end), 80 symbol wrapping, no explicit type annotations, almost no comments. This codebase is impossible to review without opening a whole IDE. I hope we're moving towards a more reader-friendly codestyle?.. 💀 )

in
snd @@ List.fold_map ~init:all_parent_paths ~f self_paths

let rec self_path_impl ~element ~depth address =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: unchanged.

let%map.Option rest = self_path_impl ~element ~depth parent_address in
el :: rest

let empty_hash =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: unchanged.

let empty_hash =
Empty_hashes.extensible_cache (module Hash) ~init_hash:Hash.empty_account

let self_path_get_hash ~hashes ~current_location height address =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: unchanged.

*)
type unhashed_account_t = Account.t option * Location.t

let sexp_of_unhashed_account_t =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these functions? They seem to be unused.

@@ -93,6 +105,7 @@ module Make (Inputs : Inputs_intf.S) = struct
This is used as a lookup cache. *)
; mutable accumulated : (accumulated_t[@sexp.opaque]) option
; mutable is_committing : bool
; mutable unhashed_accounts : unhashed_account_t list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep a hashmap here from position to the account value? Again wondering what happens if we update A1 to A2 to A3 in a batch of 2 txs: will both A2 and A3 appear in unhashed_accounts?

let set_inner_hash_at_addr_exn t address hash =
assert_is_attached t ;
assert (Addr.depth address <= t.depth) ;
self_set_hash t address hash

let hashes_and_ancestor t =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: mostly unchanged except for the first line. maps_and_ancestor was left as it is, and hashes_and_ancestor additionally runs finalize_hashes before the sub-call.

let finalize_hashes t =
let unhashed_accounts = t.unhashed_accounts in
if not @@ List.is_empty unhashed_accounts then (
t.unhashed_accounts <- [] ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to first nullify the existing unhashed_accounts before calling finalize_hashes_do?... Would the other way around work?

It's confusing because finalize_hashes_do also takes t as an argument, so you pass unhashed_accounts into it twice technically. finalize_hashes_do could just read/nullify these internally.

in
let on_snd f (_, a) (_, b) = f a b in
List.stable_sort ~compare:(on_snd Location.compare) unhashed_accounts
|> List.remove_consecutive_duplicates ~which_to_keep:`First
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here? You seem to only keep the first account for each duplicated location. I assume you want to keep the last one instead, that is the last update that was pushed to the list? Or are we prepending to the list?

@@ -81,6 +81,18 @@ module Make (Inputs : Inputs_intf.S) = struct
*)
}

(** Type for an account that wasn't yet hashed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question regarding your comment from the PR description:

When an account is touched in a few transactions of the same block, it's gets hashed a few times, whereas in fact only the final hash is truly needed

I'm wondering if this does not change the original application logic. Assume tx1 changes A and B, and tx2 changes C and A. So A1 becomes A2 and then A3, B1 becomes B2, C1 becomes C2. Surely, after tx1 and tx2 the final account state for A is A3, but don't we /at all/ care about the intermediate Merkle tree? Can we just rehash the MT based on A3 B2 C2?

(I suspect yes, just double checking)

update_maps t ~f:(fun maps ->
{ maps with hashes = Map.set maps.hashes ~key:address ~data:hash } )

let path_batch_impl ~fixup_path ~self_lookup ~base_lookup locations =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: slightly rewritten to not have t as an argument, and to not pass hashes to the self_lookup parameter (it's passed/embedded into self_lookup on the caller level).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants