feat: Mega memory benchmarks #9858

codygunton · 2024-11-08T20:00:42Z

It would be better to actually use Google Bench's memory manager functionality and count allocations. We already have something similar implemented for Tracy. After striking out with that approach for a bit I reverted to just manually counting the size of the biggest vectors.

The PR uncovered this issue: some trace structures have unusable capacity, not just due to using fewer than a dyadic number of gates, but also because of coupling of certain gate types AztecProtocol/barretenberg#1149

2024-11-14T16:49:17+00:00
Running ./bin/mega_memory_bench
Run on (48 X 2445.41 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 512 KiB (x24)
  L3 Unified 32768 KiB (x1)
Load Average: 0.11, 0.15, 0.19
ecc_op: 1014 / 1024
pub_inputs: 3990 / 4000
busread: 5988 / 6000
arithmetic: 199926 / 200000
delta_range: 24985 / 25000
elliptic: 79900 / 80000
aux: 99900 / 100000
poseidon2_external: 30118 / 30128
poseidon2_internal: 171990 / 172000
lookup: 199800 / 200000
++Estimating builder memory++
ecc_op size 464 KiB
pub_inputs size 0 KiB
busread size 3712 KiB
arithmetic size 118784 KiB
delta_range size 14848 KiB
elliptic size 59392 KiB
aux size 59392 KiB
poseidon2_external size 14848 KiB
poseidon2_internal size 118784 KiB
lookup size 118784 KiB
variables: 33554432
public inputs: 16384
variable indices: 16777216
Constructing DeciderProvingKey
WARNING: Redundant call to finalize_circuit(). Is this intentional?
Gate blocks summary: (actual gates / fixed capacity)
goblin ecc op :	1018/1024
pub inputs    :	0/4000 (populated in decider pk constructor)
busread       :	5991/6000
arithmetic    :	199928/200000
delta range   :	24987/25000
elliptic      :	79902/80000
auxiliary     :	99902/100000
poseidon ext  :	30120/30128
poseidon int  :	171992/172000
lookups       :	199802/200000
overflow      :	0/0

Finalized circuit size: 712724
Log dyadic circuit size: 20
constructing z_perm...
done constructing z_perm.
populating trace...
done populating trace.
time to construct proving key: 863 ms.
++Estimating proving key memory++
q_m num: 1048576 size: 32768 KiB
q_c num: 1048576 size: 32768 KiB
q_l num: 1048576 size: 32768 KiB
q_r num: 1048576 size: 32768 KiB
q_o num: 1048576 size: 32768 KiB
q_4 num: 1048576 size: 32768 KiB
q_busread num: 6000 size: 187 KiB
q_arith num: 405000 size: 12656 KiB
q_delta_range num: 25000 size: 781 KiB
q_elliptic num: 80000 size: 2500 KiB
q_aux num: 100000 size: 3125 KiB
q_poseidon2_external num: 30128 size: 941 KiB
q_poseidon2_internal num: 172000 size: 5375 KiB
q_lookup num: 200000 size: 6250 KiB
sigma_1 num: 1048576 size: 32768 KiB
sigma_2 num: 1048576 size: 32768 KiB
sigma_3 num: 1048576 size: 32768 KiB
sigma_4 num: 1048576 size: 32768 KiB
id_1 num: 1048576 size: 32768 KiB
id_2 num: 1048576 size: 32768 KiB
id_3 num: 1048576 size: 32768 KiB
id_4 num: 1048576 size: 32768 KiB
table_1 num: 70000 size: 2187 KiB
table_2 num: 70000 size: 2187 KiB
table_3 num: 70000 size: 2187 KiB
table_4 num: 70000 size: 2187 KiB
lagrange_first num: 1 size: 0 KiB
lagrange_last num: 1 size: 0 KiB
lagrange_ecc_op num: 1024 size: 32 KiB
databus_id num: 1048576 size: 32768 KiB
w_l num: 1048575 size: 32767 KiB
w_r num: 1048575 size: 32767 KiB
w_o num: 1048575 size: 32767 KiB
w_4 num: 1048575 size: 32767 KiB
z_perm num: 1048575 size: 32767 KiB
lookup_inverses num: 430423 size: 13450 KiB
lookup_read_counts num: 70000 size: 2187 KiB
lookup_read_tags num: 70000 size: 2187 KiB
ecc_op_wire_1 num: 1024 size: 32 KiB
ecc_op_wire_2 num: 1024 size: 32 KiB
ecc_op_wire_3 num: 1024 size: 32 KiB
ecc_op_wire_4 num: 1024 size: 32 KiB
calldata num: 10000 size: 312 KiB
calldata_read_counts num: 10000 size: 312 KiB
calldata_read_tags num: 10000 size: 312 KiB
calldata_inverses num: 11025 size: 344 KiB
secondary_calldata num: 10000 size: 312 KiB
secondary_calldata_read_counts num: 10000 size: 312 KiB
secondary_calldata_read_tags num: 10000 size: 312 KiB
secondary_calldata_inverses num: 11025 size: 344 KiB
return_data num: 10000 size: 312 KiB
return_data_read_counts num: 10000 size: 312 KiB
return_data_read_tags num: 10000 size: 312 KiB
return_data_inverses num: 11025 size: 344 KiB
table_1_shift num: 70000 size: 2187 KiB
table_2_shift num: 70000 size: 2187 KiB
table_3_shift num: 70000 size: 2187 KiB
table_4_shift num: 70000 size: 2187 KiB
w_l_shift num: 1048575 size: 32767 KiB
w_r_shift num: 1048575 size: 32767 KiB
w_o_shift num: 1048575 size: 32767 KiB
w_4_shift num: 1048575 size: 32767 KiB
z_perm_shift num: 1048575 size: 32767 KiB
-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
pk_mem/E2E_FULL_TEST/iterations:1           915 ms          878 ms            1 builder_mem_est=571.572M poly_mem_est=735.115M
ecc_op: 1014 / 1024
pub_inputs: 118 / 128
busread: 117 / 128
arithmetic: 197926 / 198000
delta_range: 51544 / 90000
elliptic: 8900 / 9000
aux: 135900 / 136000
poseidon2_external: 2490 / 2500
poseidon2_internal: 13990 / 14000
lookup: 71800 / 72000
++Estimating builder memory++
ecc_op size 464 KiB
pub_inputs size 0 KiB
busread size 58 KiB
arithmetic size 118784 KiB
delta_range size 29696 KiB
elliptic size 7424 KiB
aux size 118784 KiB
poseidon2_external size 1856 KiB
poseidon2_internal size 7424 KiB
lookup size 59392 KiB
variables: 33554432
public inputs: 512
variable indices: 16777216
Constructing DeciderProvingKey
WARNING: Redundant call to finalize_circuit(). Is this intentional?
Gate blocks summary: (actual gates / fixed capacity)
goblin ecc op :	1018/1024
pub inputs    :	0/128 (populated in decider pk constructor)
busread       :	120/128
arithmetic    :	197928/198000
delta range   :	51546/90000
elliptic      :	8902/9000
auxiliary     :	135902/136000
poseidon ext  :	2492/2500
poseidon int  :	13992/14000
lookups       :	71802/72000
overflow      :	0/0

Finalized circuit size: 346784
Log dyadic circuit size: 19
constructing z_perm...
done constructing z_perm.
populating trace...
done populating trace.
time to construct proving key: 312 ms.
++Estimating proving key memory++
q_m num: 524288 size: 16384 KiB
q_c num: 524288 size: 16384 KiB
q_l num: 524288 size: 16384 KiB
q_r num: 524288 size: 16384 KiB
q_o num: 524288 size: 16384 KiB
q_4 num: 524288 size: 16384 KiB
q_busread num: 128 size: 4 KiB
q_arith num: 433000 size: 13531 KiB
q_delta_range num: 90000 size: 2812 KiB
q_elliptic num: 9000 size: 281 KiB
q_aux num: 136000 size: 4250 KiB
q_poseidon2_external num: 2500 size: 78 KiB
q_poseidon2_internal num: 14000 size: 437 KiB
q_lookup num: 72000 size: 2250 KiB
sigma_1 num: 524288 size: 16384 KiB
sigma_2 num: 524288 size: 16384 KiB
sigma_3 num: 524288 size: 16384 KiB
sigma_4 num: 524288 size: 16384 KiB
id_1 num: 524288 size: 16384 KiB
id_2 num: 524288 size: 16384 KiB
id_3 num: 524288 size: 16384 KiB
id_4 num: 524288 size: 16384 KiB
table_1 num: 70000 size: 2187 KiB
table_2 num: 70000 size: 2187 KiB
table_3 num: 70000 size: 2187 KiB
table_4 num: 70000 size: 2187 KiB
lagrange_first num: 1 size: 0 KiB
lagrange_last num: 1 size: 0 KiB
lagrange_ecc_op num: 1024 size: 32 KiB
databus_id num: 524288 size: 16384 KiB
w_l num: 524287 size: 16383 KiB
w_r num: 524287 size: 16383 KiB
w_o num: 524287 size: 16383 KiB
w_4 num: 524287 size: 16383 KiB
z_perm num: 524287 size: 16383 KiB
lookup_inverses num: 73507 size: 2297 KiB
lookup_read_counts num: 70000 size: 2187 KiB
lookup_read_tags num: 70000 size: 2187 KiB
ecc_op_wire_1 num: 1024 size: 32 KiB
ecc_op_wire_2 num: 1024 size: 32 KiB
ecc_op_wire_3 num: 1024 size: 32 KiB
ecc_op_wire_4 num: 1024 size: 32 KiB
calldata num: 10000 size: 312 KiB
calldata_read_counts num: 10000 size: 312 KiB
calldata_read_tags num: 10000 size: 312 KiB
calldata_inverses num: 1281 size: 40 KiB
secondary_calldata num: 10000 size: 312 KiB
secondary_calldata_read_counts num: 10000 size: 312 KiB
secondary_calldata_read_tags num: 10000 size: 312 KiB
secondary_calldata_inverses num: 1281 size: 40 KiB
return_data num: 10000 size: 312 KiB
return_data_read_counts num: 10000 size: 312 KiB
return_data_read_tags num: 10000 size: 312 KiB
return_data_inverses num: 1281 size: 40 KiB
table_1_shift num: 70000 size: 2187 KiB
table_2_shift num: 70000 size: 2187 KiB
table_3_shift num: 70000 size: 2187 KiB
table_4_shift num: 70000 size: 2187 KiB
w_l_shift num: 524287 size: 16383 KiB
w_r_shift num: 524287 size: 16383 KiB
w_o_shift num: 524287 size: 16383 KiB
w_4_shift num: 524287 size: 16383 KiB
z_perm_shift num: 524287 size: 16383 KiB
pk_mem/CLIENT_IVC_BENCH/iterations:1        313 ms          296 ms            1 builder_mem_est=402.467M poly_mem_est=378.719M

github-actions · 2024-11-10T02:23:59Z

Changes to circuit sizes

Generated at commit: d6a8c9affaaf8410112d0df14051ec463738f33d, compared to commit: 5246251a7a91874e2af5f8788aa29533c3cf77fe

🧾 Summary (100% most significant diffs)

Program	ACIR opcodes (+/-)	%	Circuit size (+/-)	%
rollup_block_root	-464 ✅	-10.35%	-5,960 ✅	-0.21%
rollup_block_root_empty	-32 ✅	-34.04%	-32 ✅	-1.11%
rollup_block_merge	-6,944 ✅	-57.86%	-29,107 ✅	-1.50%
rollup_root	-6,944 ✅	-57.94%	-29,107 ✅	-1.50%
rollup_base_public	-15,774 ✅	-3.36%	-221,618 ✅	-5.88%
rollup_base_private	-15,774 ✅	-4.74%	-221,618 ✅	-6.46%
private_kernel_reset_4_4_4_4_4_4_4_4_1	-667 ✅	-2.06%	-9,667 ✅	-11.16%
private_kernel_reset	-10,767 ✅	-12.80%	-155,407 ✅	-25.14%

Full diff report 👇

Program	ACIR opcodes (+/-)	%	Circuit size (+/-)	%
rollup_block_root	4,021 (-464)	-10.35%	2,857,288 (-5,960)	-0.21%
rollup_block_root_empty	62 (-32)	-34.04%	2,859 (-32)	-1.11%
rollup_block_merge	5,057 (-6,944)	-57.86%	1,911,809 (-29,107)	-1.50%
rollup_root	5,041 (-6,944)	-57.94%	1,911,795 (-29,107)	-1.50%
rollup_base_public	454,278 (-15,774)	-3.36%	3,549,032 (-221,618)	-5.88%
rollup_base_private	317,095 (-15,774)	-4.74%	3,210,904 (-221,618)	-6.46%
private_kernel_reset_4_4_4_4_4_4_4_4_1	31,789 (-667)	-2.06%	76,935 (-9,667)	-11.16%
private_kernel_reset	73,331 (-10,767)	-12.80%	462,711 (-155,407)	-25.14%

``` -------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------------- construct_pk/TraceStructure::E2E_FULL_TEST 1160 ms 1129 ms 1 pk mem=911.716M construct_pk/TraceStructure::CLIENT_IVC_BENCH 365 ms 349 ms 2 pk mem=471.561M ```

This reverts commit c564a56.

ledwards2225

Nice, couple of comments

ledwards2225 · 2024-11-14T17:13:59Z

barretenberg/cpp/src/barretenberg/benchmark/mega_memory_bench/mega_memory.bench.cpp

+    fill_lookup_block(builder);
+
+    {
+        // finalize doesn't populate public inputs block, so copy to verify that the block is being filled well.


yeah this needs to be robustified. probably makes sense to just have the builder do it rather than populating a separate public_inputs object. There are a number of things that are more awkward than they need to be because we maintain this somewhat artificial distinction between the builder and the prover (e.g. structured trace is fundamentally a 'prover' notion). Maybe we should just do away with this division when it makes sense

ledwards2225 · 2024-11-14T17:25:04Z

barretenberg/cpp/src/barretenberg/benchmark/mega_memory_bench/mega_memory.bench.cpp

+    Builder builder;
+    builder.blocks.set_fixed_block_sizes(settings);
+
+    fill_ecc_op_block(builder);


I'm not sure I see the need to populate these blocks with real data. In the structured case I think the actual block sizes are irrelevant because the fixed sizes are what determine memory. In the unstructured case it should only be about the size() of the blocks but even then it doesn't matter if we're storing zeros or something else. This would also make it easy to just populate the blocks with exactly the right size.

ledwards2225 · 2024-11-14T17:28:48Z

barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/mega_flavor.hpp

+            vinfo("++Estimating proving key memory++");
+            for (auto [polynomial, label] : zip_view(polynomials.get_all(), polynomials.get_labels())) {
+                uint64_t size = polynomial.size();
+                vinfo(label, " num: ", size, " size: ", (size * sizeof(FF)) >> 10, " KiB");


I know you're not counting shifts in the total memory but I was a bit mislead by this print at first because it makes it look like shifts are utilizing their own mem

ledwards2225 · 2024-11-14T17:38:21Z

barretenberg/cpp/src/barretenberg/benchmark/mega_memory_bench/mega_memory.bench.cpp

+        // alternative: add to finalize or add a flag to check whether PIs have already been populated
+        auto builder_copy = builder;
+        builder_copy.finalize_circuit(/* ensure_nonzero */ false);
+        DeciderProvingKey::Trace::populate_public_inputs_block(builder_copy);


One other thing. If you just construct a DeciderProvingKey from this builder_copy, it will now automatically compute the overflow / whether or not there is overflow. The print isn't as detailed as this one because I didn't add the labels (as you now have) but it could be made to be similar to what you have here (and probably should be for the sake of dev experience)

It would be better to actually use Google Bench's memory manager functionality and count allocations. We already have something similar implemented for Tracy. After striking out with that approach for a bit I reverted to just manually counting the size of the biggest vectors. The PR uncovered this issue: some trace structures have unusable capacity, not just due to using fewer than a dyadic number of gates, but also because of coupling of certain gate types #1149 See AztecProtocol/aztec-packages#9858 for logs of benchmarks.

codygunton self-assigned this Nov 8, 2024

codygunton force-pushed the cg/memory-bench branch from 8c4de71 to 6a13552 Compare November 14, 2024 14:28

codygunton changed the base branch from master to cg/revert-tbb-default-and-ivc-matching-regression-mac-build November 14, 2024 14:46

Base automatically changed from cg/revert-tbb-default-and-ivc-matching-regression-mac-build to master November 14, 2024 15:14

codygunton added 22 commits November 14, 2024 15:49

This works to fill all of the gates

8212f78

Use most tables

81ec7e8

OK filling verified for public inputs

272a6b3

Trying overloads but getting segfaults

f023c3f

Better info

83af88f

Estimate builder memory (iou variables)

44b5a56

Stop building st wasm in yarn build

629203b

Fix build by ignoring databus

03a3db4

Revert change in circuit hash (?!)

8a99eac

Track more of builder

8649ce8

Track PIs (in use?)

3fa411f

Custom allocator doesn't build in gcc

d5af273

Revert "Custom allocator doesn't build in gcc"

197d592

This reverts commit c564a56.

Some cleanup

d2a893c

Revert yarn changes

42cd9f6

More cleanup and verbose logging

82a1dd5

Rebase over Luke's structure changes

2f451b0

More cleanup

44f17c5

Remove (unneeded?) header

8d95637

Deleted unused table id PEDERSEN_IV

78afdcf

More cleanup

530660a

codygunton force-pushed the cg/memory-bench branch from 600073f to 530660a Compare November 14, 2024 15:49

codygunton added 2 commits November 14, 2024 15:57

Check prints and clarify summarization output

8e600e6

More cleanup in new file

e1daef7

codygunton requested a review from ledwards2225 November 14, 2024 17:08

ledwards2225 approved these changes Nov 14, 2024

View reviewed changes

ledwards2225 reviewed Nov 14, 2024

View reviewed changes

codygunton merged commit 7e587d6 into master Nov 14, 2024
46 checks passed

codygunton deleted the cg/memory-bench branch November 14, 2024 17:55

AztecBot mentioned this pull request Nov 14, 2024

chore(master): Release 0.63.0 #9651

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Mega memory benchmarks #9858

feat: Mega memory benchmarks #9858

codygunton commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 10, 2024

ledwards2225 left a comment

ledwards2225 Nov 14, 2024

ledwards2225 Nov 14, 2024

ledwards2225 Nov 14, 2024

ledwards2225 Nov 14, 2024

feat: Mega memory benchmarks #9858

feat: Mega memory benchmarks #9858

Conversation

codygunton commented Nov 8, 2024 • edited Loading

github-actions bot commented Nov 10, 2024

Changes to circuit sizes

🧾 Summary (100% most significant diffs)

ledwards2225 left a comment

Choose a reason for hiding this comment

ledwards2225 Nov 14, 2024

Choose a reason for hiding this comment

ledwards2225 Nov 14, 2024

Choose a reason for hiding this comment

ledwards2225 Nov 14, 2024

Choose a reason for hiding this comment

ledwards2225 Nov 14, 2024

Choose a reason for hiding this comment

codygunton commented Nov 8, 2024 •

edited

Loading