Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Mega memory benchmarks #9858

Merged
merged 24 commits into from
Nov 14, 2024
Merged

feat: Mega memory benchmarks #9858

merged 24 commits into from
Nov 14, 2024

Conversation

codygunton
Copy link
Contributor

@codygunton codygunton commented Nov 8, 2024

It would be better to actually use Google Bench's memory manager functionality and count allocations. We already have something similar implemented for Tracy. After striking out with that approach for a bit I reverted to just manually counting the size of the biggest vectors.

The PR uncovered this issue: some trace structures have unusable capacity, not just due to using fewer than a dyadic number of gates, but also because of coupling of certain gate types AztecProtocol/barretenberg#1149

2024-11-14T16:49:17+00:00
Running ./bin/mega_memory_bench
Run on (48 X 2445.41 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 512 KiB (x24)
  L3 Unified 32768 KiB (x1)
Load Average: 0.11, 0.15, 0.19
ecc_op: 1014 / 1024
pub_inputs: 3990 / 4000
busread: 5988 / 6000
arithmetic: 199926 / 200000
delta_range: 24985 / 25000
elliptic: 79900 / 80000
aux: 99900 / 100000
poseidon2_external: 30118 / 30128
poseidon2_internal: 171990 / 172000
lookup: 199800 / 200000
++Estimating builder memory++
ecc_op size 464 KiB
pub_inputs size 0 KiB
busread size 3712 KiB
arithmetic size 118784 KiB
delta_range size 14848 KiB
elliptic size 59392 KiB
aux size 59392 KiB
poseidon2_external size 14848 KiB
poseidon2_internal size 118784 KiB
lookup size 118784 KiB
variables: 33554432
public inputs: 16384
variable indices: 16777216
Constructing DeciderProvingKey
WARNING: Redundant call to finalize_circuit(). Is this intentional?
Gate blocks summary: (actual gates / fixed capacity)
goblin ecc op :	1018/1024
pub inputs    :	0/4000 (populated in decider pk constructor)
busread       :	5991/6000
arithmetic    :	199928/200000
delta range   :	24987/25000
elliptic      :	79902/80000
auxiliary     :	99902/100000
poseidon ext  :	30120/30128
poseidon int  :	171992/172000
lookups       :	199802/200000
overflow      :	0/0

Finalized circuit size: 712724
Log dyadic circuit size: 20
constructing z_perm...
done constructing z_perm.
populating trace...
done populating trace.
time to construct proving key: 863 ms.
++Estimating proving key memory++
q_m num: 1048576 size: 32768 KiB
q_c num: 1048576 size: 32768 KiB
q_l num: 1048576 size: 32768 KiB
q_r num: 1048576 size: 32768 KiB
q_o num: 1048576 size: 32768 KiB
q_4 num: 1048576 size: 32768 KiB
q_busread num: 6000 size: 187 KiB
q_arith num: 405000 size: 12656 KiB
q_delta_range num: 25000 size: 781 KiB
q_elliptic num: 80000 size: 2500 KiB
q_aux num: 100000 size: 3125 KiB
q_poseidon2_external num: 30128 size: 941 KiB
q_poseidon2_internal num: 172000 size: 5375 KiB
q_lookup num: 200000 size: 6250 KiB
sigma_1 num: 1048576 size: 32768 KiB
sigma_2 num: 1048576 size: 32768 KiB
sigma_3 num: 1048576 size: 32768 KiB
sigma_4 num: 1048576 size: 32768 KiB
id_1 num: 1048576 size: 32768 KiB
id_2 num: 1048576 size: 32768 KiB
id_3 num: 1048576 size: 32768 KiB
id_4 num: 1048576 size: 32768 KiB
table_1 num: 70000 size: 2187 KiB
table_2 num: 70000 size: 2187 KiB
table_3 num: 70000 size: 2187 KiB
table_4 num: 70000 size: 2187 KiB
lagrange_first num: 1 size: 0 KiB
lagrange_last num: 1 size: 0 KiB
lagrange_ecc_op num: 1024 size: 32 KiB
databus_id num: 1048576 size: 32768 KiB
w_l num: 1048575 size: 32767 KiB
w_r num: 1048575 size: 32767 KiB
w_o num: 1048575 size: 32767 KiB
w_4 num: 1048575 size: 32767 KiB
z_perm num: 1048575 size: 32767 KiB
lookup_inverses num: 430423 size: 13450 KiB
lookup_read_counts num: 70000 size: 2187 KiB
lookup_read_tags num: 70000 size: 2187 KiB
ecc_op_wire_1 num: 1024 size: 32 KiB
ecc_op_wire_2 num: 1024 size: 32 KiB
ecc_op_wire_3 num: 1024 size: 32 KiB
ecc_op_wire_4 num: 1024 size: 32 KiB
calldata num: 10000 size: 312 KiB
calldata_read_counts num: 10000 size: 312 KiB
calldata_read_tags num: 10000 size: 312 KiB
calldata_inverses num: 11025 size: 344 KiB
secondary_calldata num: 10000 size: 312 KiB
secondary_calldata_read_counts num: 10000 size: 312 KiB
secondary_calldata_read_tags num: 10000 size: 312 KiB
secondary_calldata_inverses num: 11025 size: 344 KiB
return_data num: 10000 size: 312 KiB
return_data_read_counts num: 10000 size: 312 KiB
return_data_read_tags num: 10000 size: 312 KiB
return_data_inverses num: 11025 size: 344 KiB
table_1_shift num: 70000 size: 2187 KiB
table_2_shift num: 70000 size: 2187 KiB
table_3_shift num: 70000 size: 2187 KiB
table_4_shift num: 70000 size: 2187 KiB
w_l_shift num: 1048575 size: 32767 KiB
w_r_shift num: 1048575 size: 32767 KiB
w_o_shift num: 1048575 size: 32767 KiB
w_4_shift num: 1048575 size: 32767 KiB
z_perm_shift num: 1048575 size: 32767 KiB
-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
pk_mem/E2E_FULL_TEST/iterations:1           915 ms          878 ms            1 builder_mem_est=571.572M poly_mem_est=735.115M
ecc_op: 1014 / 1024
pub_inputs: 118 / 128
busread: 117 / 128
arithmetic: 197926 / 198000
delta_range: 51544 / 90000
elliptic: 8900 / 9000
aux: 135900 / 136000
poseidon2_external: 2490 / 2500
poseidon2_internal: 13990 / 14000
lookup: 71800 / 72000
++Estimating builder memory++
ecc_op size 464 KiB
pub_inputs size 0 KiB
busread size 58 KiB
arithmetic size 118784 KiB
delta_range size 29696 KiB
elliptic size 7424 KiB
aux size 118784 KiB
poseidon2_external size 1856 KiB
poseidon2_internal size 7424 KiB
lookup size 59392 KiB
variables: 33554432
public inputs: 512
variable indices: 16777216
Constructing DeciderProvingKey
WARNING: Redundant call to finalize_circuit(). Is this intentional?
Gate blocks summary: (actual gates / fixed capacity)
goblin ecc op :	1018/1024
pub inputs    :	0/128 (populated in decider pk constructor)
busread       :	120/128
arithmetic    :	197928/198000
delta range   :	51546/90000
elliptic      :	8902/9000
auxiliary     :	135902/136000
poseidon ext  :	2492/2500
poseidon int  :	13992/14000
lookups       :	71802/72000
overflow      :	0/0

Finalized circuit size: 346784
Log dyadic circuit size: 19
constructing z_perm...
done constructing z_perm.
populating trace...
done populating trace.
time to construct proving key: 312 ms.
++Estimating proving key memory++
q_m num: 524288 size: 16384 KiB
q_c num: 524288 size: 16384 KiB
q_l num: 524288 size: 16384 KiB
q_r num: 524288 size: 16384 KiB
q_o num: 524288 size: 16384 KiB
q_4 num: 524288 size: 16384 KiB
q_busread num: 128 size: 4 KiB
q_arith num: 433000 size: 13531 KiB
q_delta_range num: 90000 size: 2812 KiB
q_elliptic num: 9000 size: 281 KiB
q_aux num: 136000 size: 4250 KiB
q_poseidon2_external num: 2500 size: 78 KiB
q_poseidon2_internal num: 14000 size: 437 KiB
q_lookup num: 72000 size: 2250 KiB
sigma_1 num: 524288 size: 16384 KiB
sigma_2 num: 524288 size: 16384 KiB
sigma_3 num: 524288 size: 16384 KiB
sigma_4 num: 524288 size: 16384 KiB
id_1 num: 524288 size: 16384 KiB
id_2 num: 524288 size: 16384 KiB
id_3 num: 524288 size: 16384 KiB
id_4 num: 524288 size: 16384 KiB
table_1 num: 70000 size: 2187 KiB
table_2 num: 70000 size: 2187 KiB
table_3 num: 70000 size: 2187 KiB
table_4 num: 70000 size: 2187 KiB
lagrange_first num: 1 size: 0 KiB
lagrange_last num: 1 size: 0 KiB
lagrange_ecc_op num: 1024 size: 32 KiB
databus_id num: 524288 size: 16384 KiB
w_l num: 524287 size: 16383 KiB
w_r num: 524287 size: 16383 KiB
w_o num: 524287 size: 16383 KiB
w_4 num: 524287 size: 16383 KiB
z_perm num: 524287 size: 16383 KiB
lookup_inverses num: 73507 size: 2297 KiB
lookup_read_counts num: 70000 size: 2187 KiB
lookup_read_tags num: 70000 size: 2187 KiB
ecc_op_wire_1 num: 1024 size: 32 KiB
ecc_op_wire_2 num: 1024 size: 32 KiB
ecc_op_wire_3 num: 1024 size: 32 KiB
ecc_op_wire_4 num: 1024 size: 32 KiB
calldata num: 10000 size: 312 KiB
calldata_read_counts num: 10000 size: 312 KiB
calldata_read_tags num: 10000 size: 312 KiB
calldata_inverses num: 1281 size: 40 KiB
secondary_calldata num: 10000 size: 312 KiB
secondary_calldata_read_counts num: 10000 size: 312 KiB
secondary_calldata_read_tags num: 10000 size: 312 KiB
secondary_calldata_inverses num: 1281 size: 40 KiB
return_data num: 10000 size: 312 KiB
return_data_read_counts num: 10000 size: 312 KiB
return_data_read_tags num: 10000 size: 312 KiB
return_data_inverses num: 1281 size: 40 KiB
table_1_shift num: 70000 size: 2187 KiB
table_2_shift num: 70000 size: 2187 KiB
table_3_shift num: 70000 size: 2187 KiB
table_4_shift num: 70000 size: 2187 KiB
w_l_shift num: 524287 size: 16383 KiB
w_r_shift num: 524287 size: 16383 KiB
w_o_shift num: 524287 size: 16383 KiB
w_4_shift num: 524287 size: 16383 KiB
z_perm_shift num: 524287 size: 16383 KiB
pk_mem/CLIENT_IVC_BENCH/iterations:1        313 ms          296 ms            1 builder_mem_est=402.467M poly_mem_est=378.719M

@codygunton codygunton self-assigned this Nov 8, 2024
Copy link
Contributor

Changes to circuit sizes

Generated at commit: d6a8c9affaaf8410112d0df14051ec463738f33d, compared to commit: 5246251a7a91874e2af5f8788aa29533c3cf77fe

🧾 Summary (100% most significant diffs)

Program ACIR opcodes (+/-) % Circuit size (+/-) %
rollup_block_root -464 ✅ -10.35% -5,960 ✅ -0.21%
rollup_block_root_empty -32 ✅ -34.04% -32 ✅ -1.11%
rollup_block_merge -6,944 ✅ -57.86% -29,107 ✅ -1.50%
rollup_root -6,944 ✅ -57.94% -29,107 ✅ -1.50%
rollup_base_public -15,774 ✅ -3.36% -221,618 ✅ -5.88%
rollup_base_private -15,774 ✅ -4.74% -221,618 ✅ -6.46%
private_kernel_reset_4_4_4_4_4_4_4_4_1 -667 ✅ -2.06% -9,667 ✅ -11.16%
private_kernel_reset -10,767 ✅ -12.80% -155,407 ✅ -25.14%

Full diff report 👇
Program ACIR opcodes (+/-) % Circuit size (+/-) %
rollup_block_root 4,021 (-464) -10.35% 2,857,288 (-5,960) -0.21%
rollup_block_root_empty 62 (-32) -34.04% 2,859 (-32) -1.11%
rollup_block_merge 5,057 (-6,944) -57.86% 1,911,809 (-29,107) -1.50%
rollup_root 5,041 (-6,944) -57.94% 1,911,795 (-29,107) -1.50%
rollup_base_public 454,278 (-15,774) -3.36% 3,549,032 (-221,618) -5.88%
rollup_base_private 317,095 (-15,774) -4.74% 3,210,904 (-221,618) -6.46%
private_kernel_reset_4_4_4_4_4_4_4_4_1 31,789 (-667) -2.06% 76,935 (-9,667) -11.16%
private_kernel_reset 73,331 (-10,767) -12.80% 462,711 (-155,407) -25.14%

@codygunton codygunton changed the base branch from master to cg/revert-tbb-default-and-ivc-matching-regression-mac-build November 14, 2024 14:46
Base automatically changed from cg/revert-tbb-default-and-ivc-matching-regression-mac-build to master November 14, 2024 15:14
Copy link
Contributor

@ledwards2225 ledwards2225 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, couple of comments

fill_lookup_block(builder);

{
// finalize doesn't populate public inputs block, so copy to verify that the block is being filled well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this needs to be robustified. probably makes sense to just have the builder do it rather than populating a separate public_inputs object. There are a number of things that are more awkward than they need to be because we maintain this somewhat artificial distinction between the builder and the prover (e.g. structured trace is fundamentally a 'prover' notion). Maybe we should just do away with this division when it makes sense

Builder builder;
builder.blocks.set_fixed_block_sizes(settings);

fill_ecc_op_block(builder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see the need to populate these blocks with real data. In the structured case I think the actual block sizes are irrelevant because the fixed sizes are what determine memory. In the unstructured case it should only be about the size() of the blocks but even then it doesn't matter if we're storing zeros or something else. This would also make it easy to just populate the blocks with exactly the right size.

vinfo("++Estimating proving key memory++");
for (auto [polynomial, label] : zip_view(polynomials.get_all(), polynomials.get_labels())) {
uint64_t size = polynomial.size();
vinfo(label, " num: ", size, " size: ", (size * sizeof(FF)) >> 10, " KiB");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're not counting shifts in the total memory but I was a bit mislead by this print at first because it makes it look like shifts are utilizing their own mem

// alternative: add to finalize or add a flag to check whether PIs have already been populated
auto builder_copy = builder;
builder_copy.finalize_circuit(/* ensure_nonzero */ false);
DeciderProvingKey::Trace::populate_public_inputs_block(builder_copy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing. If you just construct a DeciderProvingKey from this builder_copy, it will now automatically compute the overflow / whether or not there is overflow. The print isn't as detailed as this one because I didn't add the labels (as you now have) but it could be made to be similar to what you have here (and probably should be for the sake of dev experience)

@codygunton codygunton merged commit 7e587d6 into master Nov 14, 2024
46 checks passed
@codygunton codygunton deleted the cg/memory-bench branch November 14, 2024 17:55
AztecBot pushed a commit to AztecProtocol/barretenberg that referenced this pull request Nov 15, 2024
It would be better to actually use Google Bench's memory manager
functionality and count allocations. We already have something similar
implemented for Tracy. After striking out with that approach for a bit I
reverted to just manually counting the size of the biggest vectors.

The PR uncovered this issue: some trace structures have unusable
capacity, not just due to using fewer than a dyadic number of gates, but
also because of coupling of certain gate types
#1149

See AztecProtocol/aztec-packages#9858 for logs of benchmarks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants