Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SIMD implementation for as-sha256 #367

Merged
merged 7 commits into from
May 24, 2024
Merged

Conversation

twoeths
Copy link
Contributor

@twoeths twoeths commented Apr 15, 2024

Motivation

SIMD is available in assemblyscript, it supports v128 data structure which mean we can hash 4 inputs in parallel

Description

  • New assemblyscript simd implementation in assembly/simd.ts
  • New methods to support hashing 4 inputs (each 64 bytes) in parallel:
    • hash4Input64s(inputs: Uint8Array[]): Uint8Array[]
    • hash8HashObjects(inputs: HashObject[])
  • Add unit tests and benchmarks

Closes #356

Copy link

github-actions bot commented Apr 15, 2024

Performance Report

✔️ no performance regression detected

Full benchmark results
Benchmark suite Current: 8133481 Previous: cf8f049 Ratio
digestTwoHashObjects 50023 times 47.923 ms/op 47.926 ms/op 1.00
digest64 50023 times 50.469 ms/op 50.930 ms/op 0.99
digest 50023 times 52.153 ms/op 52.992 ms/op 0.98
input length 32 1.2030 us/op 1.1920 us/op 1.01
input length 64 1.3590 us/op 1.3970 us/op 0.97
input length 128 2.2660 us/op 2.3880 us/op 0.95
input length 256 3.3830 us/op 3.4430 us/op 0.98
input length 512 5.5630 us/op 5.6190 us/op 0.99
input length 1024 10.707 us/op 10.763 us/op 0.99
digest 1000000 times 824.19 ms/op 837.14 ms/op 0.98
hashObjectToByteArray 50023 times 1.4283 ms/op 1.4692 ms/op 0.97
byteArrayToHashObject 50023 times 2.4242 ms/op 2.4603 ms/op 0.99
digest64 200092 times 206.57 ms/op
hash 200092 times using batchHash4UintArray64s 212.05 ms/op
hash 200092 times using batchHash4HashObjectInputs 212.59 ms/op
getGindicesAtDepth 4.6080 us/op 4.6690 us/op 0.99
iterateAtDepth 7.2810 us/op 7.4530 us/op 0.98
getGindexBits 428.00 ns/op 430.00 ns/op 1.00
gindexIterator 1.0290 us/op 972.00 ns/op 1.06
hash 2 Uint8Array 2250026 times - as-sha256 2.3156 s/op 2.3533 s/op 0.98
hashTwoObjects 2250026 times - as-sha256 2.1663 s/op 2.2222 s/op 0.97
hash 2 Uint8Array 2250026 times - noble 5.0159 s/op 5.2452 s/op 0.96
hashTwoObjects 2250026 times - noble 6.8932 s/op 6.8410 s/op 1.01
getNodeH() x7812.5 avg hindex 12.143 us/op 12.969 us/op 0.94
getNodeH() x7812.5 index 0 6.3680 us/op 6.6040 us/op 0.96
getNodeH() x7812.5 index 7 6.4100 us/op 6.5780 us/op 0.97
getNodeH() x7812.5 index 7 with key array 6.3800 us/op 6.4950 us/op 0.98
new LeafNode() x7812.5 14.760 us/op 15.032 us/op 0.98
multiproof - depth 15, 1 requested leaves 8.6070 us/op 9.6410 us/op 0.89
tree offset multiproof - depth 15, 1 requested leaves 19.633 us/op 20.563 us/op 0.95
compact multiproof - depth 15, 1 requested leaves 3.7230 us/op 5.4290 us/op 0.69
multiproof - depth 15, 2 requested leaves 11.534 us/op 12.903 us/op 0.89
tree offset multiproof - depth 15, 2 requested leaves 21.439 us/op 23.655 us/op 0.91
compact multiproof - depth 15, 2 requested leaves 3.4330 us/op 4.4640 us/op 0.77
multiproof - depth 15, 3 requested leaves 16.153 us/op 18.176 us/op 0.89
tree offset multiproof - depth 15, 3 requested leaves 27.953 us/op 29.919 us/op 0.93
compact multiproof - depth 15, 3 requested leaves 4.1860 us/op 6.4790 us/op 0.65
multiproof - depth 15, 4 requested leaves 21.466 us/op 23.370 us/op 0.92
tree offset multiproof - depth 15, 4 requested leaves 33.883 us/op 36.995 us/op 0.92
compact multiproof - depth 15, 4 requested leaves 5.0580 us/op 5.3080 us/op 0.95
packedRootsBytesToLeafNodes bytes 4000 offset 0 1.9560 us/op 1.9930 us/op 0.98
packedRootsBytesToLeafNodes bytes 4000 offset 1 1.9810 us/op 2.0020 us/op 0.99
packedRootsBytesToLeafNodes bytes 4000 offset 2 1.9630 us/op 2.0000 us/op 0.98
packedRootsBytesToLeafNodes bytes 4000 offset 3 1.8760 us/op 1.9940 us/op 0.94
subtreeFillToContents depth 40 count 250000 46.530 ms/op 45.958 ms/op 1.01
setRoot - gindexBitstring 8.1636 ms/op 8.4206 ms/op 0.97
setRoot - gindex 8.5065 ms/op 8.7619 ms/op 0.97
getRoot - gindexBitstring 2.4350 ms/op 2.4504 ms/op 0.99
getRoot - gindex 3.3562 ms/op 3.3620 ms/op 1.00
getHashObject then setHashObject 10.247 ms/op 10.481 ms/op 0.98
setNodeWithFn 7.9182 ms/op 8.0530 ms/op 0.98
getNodeAtDepth depth 0 x100000 1.0832 ms/op 1.0852 ms/op 1.00
setNodeAtDepth depth 0 x100000 2.3466 ms/op 2.4234 ms/op 0.97
getNodesAtDepth depth 0 x100000 1.0524 ms/op 1.0538 ms/op 1.00
setNodesAtDepth depth 0 x100000 1.4245 ms/op 1.4528 ms/op 0.98
getNodeAtDepth depth 1 x100000 1.1464 ms/op 1.1686 ms/op 0.98
setNodeAtDepth depth 1 x100000 5.1183 ms/op 5.1398 ms/op 1.00
getNodesAtDepth depth 1 x100000 1.1763 ms/op 1.1909 ms/op 0.99
setNodesAtDepth depth 1 x100000 4.3033 ms/op 4.3132 ms/op 1.00
getNodeAtDepth depth 2 x100000 1.4276 ms/op 1.4221 ms/op 1.00
setNodeAtDepth depth 2 x100000 8.7806 ms/op 10.417 ms/op 0.84
getNodesAtDepth depth 2 x100000 16.869 ms/op 18.389 ms/op 0.92
setNodesAtDepth depth 2 x100000 12.381 ms/op 12.926 ms/op 0.96
tree.getNodesAtDepth - gindexes 7.7827 ms/op 8.0320 ms/op 0.97
tree.getNodesAtDepth - push all nodes 1.9585 ms/op 1.9345 ms/op 1.01
tree.getNodesAtDepth - navigation 233.92 us/op 235.57 us/op 0.99
tree.setNodesAtDepth - indexes 349.98 us/op 308.89 us/op 1.13
set at depth 8 443.00 ns/op 450.00 ns/op 0.98
set at depth 16 588.00 ns/op 596.00 ns/op 0.99
set at depth 32 951.00 ns/op 958.00 ns/op 0.99
iterateNodesAtDepth 8 256 13.080 us/op 13.212 us/op 0.99
getNodesAtDepth 8 256 3.4390 us/op 3.3790 us/op 1.02
iterateNodesAtDepth 16 65536 4.2388 ms/op 4.3308 ms/op 0.98
getNodesAtDepth 16 65536 1.5835 ms/op 1.6273 ms/op 0.97
iterateNodesAtDepth 32 250000 15.410 ms/op 15.634 ms/op 0.99
getNodesAtDepth 32 250000 4.3000 ms/op 4.3522 ms/op 0.99
iterateNodesAtDepth 40 250000 15.540 ms/op 15.708 ms/op 0.99
getNodesAtDepth 40 250000 4.3836 ms/op 4.4330 ms/op 0.99
250k validators 7.1398 s/op 7.1114 s/op 1.00
bitlist bytes to struct (120,90) 482.00 ns/op 484.00 ns/op 1.00
bitlist bytes to tree (120,90) 2.1360 us/op 2.1460 us/op 1.00
bitlist bytes to struct (2048,2048) 911.00 ns/op 922.00 ns/op 0.99
bitlist bytes to tree (2048,2048) 3.3240 us/op 3.3630 us/op 0.99
ByteListType - deserialize 7.8165 ms/op 7.3046 ms/op 1.07
BasicListType - deserialize 11.857 ms/op 11.915 ms/op 1.00
ByteListType - serialize 7.8777 ms/op 7.9004 ms/op 1.00
BasicListType - serialize 9.6364 ms/op 10.023 ms/op 0.96
BasicListType - tree_convertToStruct 22.355 ms/op 22.655 ms/op 0.99
List[uint8, 68719476736] len 300000 ViewDU.getAll() + iterate 4.3003 ms/op 4.4147 ms/op 0.97
List[uint8, 68719476736] len 300000 ViewDU.get(i) 4.1212 ms/op 2.9512 ms/op 1.40
Array.push len 300000 empty Array - number 6.3746 ms/op 6.2896 ms/op 1.01
Array.set len 300000 from new Array - number 1.6630 ms/op 1.7071 ms/op 0.97
Array.set len 300000 - number 5.2218 ms/op 5.2257 ms/op 1.00
Uint8Array.set len 300000 373.14 us/op 372.38 us/op 1.00
Uint32Array.set len 300000 443.43 us/op 445.15 us/op 1.00
Container({a: uint8, b: uint8}) getViewDU x300000 52.403 ms/op 49.804 ms/op 1.05
ContainerNodeStruct({a: uint8, b: uint8}) getViewDU x300000 10.700 ms/op 10.834 ms/op 0.99
List(Container) len 300000 ViewDU.getAllReadonly() + iterate 208.75 ms/op 209.73 ms/op 1.00
List(Container) len 300000 ViewDU.getAllReadonlyValues() + iterate 316.36 ms/op 273.31 ms/op 1.16
List(Container) len 300000 ViewDU.get(i) 8.7640 ms/op 6.3717 ms/op 1.38
List(Container) len 300000 ViewDU.getReadonly(i) 8.1774 ms/op 6.3376 ms/op 1.29
List(ContainerNodeStruct) len 300000 ViewDU.getAllReadonly() + iterate 40.470 ms/op 41.496 ms/op 0.98
List(ContainerNodeStruct) len 300000 ViewDU.getAllReadonlyValues() + iterate 5.6273 ms/op 5.1590 ms/op 1.09
List(ContainerNodeStruct) len 300000 ViewDU.get(i) 7.2073 ms/op 5.9948 ms/op 1.20
List(ContainerNodeStruct) len 300000 ViewDU.getReadonly(i) 7.1238 ms/op 5.9572 ms/op 1.20
Array.push len 300000 empty Array - object 6.8128 ms/op 5.9218 ms/op 1.15
Array.set len 300000 from new Array - object 2.2630 ms/op 1.9831 ms/op 1.14
Array.set len 300000 - object 6.7586 ms/op 5.7016 ms/op 1.19
cachePermanentRootStruct no cache 9.2840 us/op 8.5850 us/op 1.08
cachePermanentRootStruct with cache 237.00 ns/op 188.00 ns/op 1.26
epochParticipation len 250000 rws 7813 2.3041 ms/op 1.8994 ms/op 1.21
deserialize Attestation - tree 4.5990 us/op 4.0490 us/op 1.14
deserialize Attestation - struct 2.0270 us/op 1.7750 us/op 1.14
deserialize SignedAggregateAndProof - tree 3.7370 us/op 3.6180 us/op 1.03
deserialize SignedAggregateAndProof - struct 3.1580 us/op 2.9150 us/op 1.08
deserialize SyncCommitteeMessage - tree 1.0770 us/op 1.0360 us/op 1.04
deserialize SyncCommitteeMessage - struct 1.1750 us/op 980.00 ns/op 1.20
deserialize SignedContributionAndProof - tree 2.1180 us/op 1.9690 us/op 1.08
deserialize SignedContributionAndProof - struct 2.5370 us/op 2.3590 us/op 1.08
deserialize SignedBeaconBlock - tree 238.34 us/op 208.32 us/op 1.14
deserialize SignedBeaconBlock - struct 126.23 us/op 120.84 us/op 1.04
BeaconState vc 300000 - deserialize tree 598.10 ms/op 593.02 ms/op 1.01
BeaconState vc 300000 - serialize tree 147.94 ms/op 148.19 ms/op 1.00
BeaconState.historicalRoots vc 300000 - deserialize tree 876.00 ns/op 821.00 ns/op 1.07
BeaconState.historicalRoots vc 300000 - serialize tree 800.00 ns/op 765.00 ns/op 1.05
BeaconState.validators vc 300000 - deserialize tree 550.23 ms/op 521.80 ms/op 1.05
BeaconState.validators vc 300000 - serialize tree 98.321 ms/op 102.19 ms/op 0.96
BeaconState.balances vc 300000 - deserialize tree 20.496 ms/op 20.686 ms/op 0.99
BeaconState.balances vc 300000 - serialize tree 4.0125 ms/op 3.9926 ms/op 1.00
BeaconState.previousEpochParticipation vc 300000 - deserialize tree 548.56 us/op 684.49 us/op 0.80
BeaconState.previousEpochParticipation vc 300000 - serialize tree 291.01 us/op 288.96 us/op 1.01
BeaconState.currentEpochParticipation vc 300000 - deserialize tree 563.17 us/op 450.13 us/op 1.25
BeaconState.currentEpochParticipation vc 300000 - serialize tree 283.88 us/op 287.17 us/op 0.99
BeaconState.inactivityScores vc 300000 - deserialize tree 21.006 ms/op 20.081 ms/op 1.05
BeaconState.inactivityScores vc 300000 - serialize tree 4.1597 ms/op 3.6692 ms/op 1.13
hashTreeRoot Attestation - struct 33.643 us/op 27.463 us/op 1.23
hashTreeRoot Attestation - tree 21.286 us/op 18.111 us/op 1.18
hashTreeRoot SignedAggregateAndProof - struct 57.859 us/op 37.426 us/op 1.55
hashTreeRoot SignedAggregateAndProof - tree 29.846 us/op 27.126 us/op 1.10
hashTreeRoot SyncCommitteeMessage - struct 10.282 us/op 8.9650 us/op 1.15
hashTreeRoot SyncCommitteeMessage - tree 6.6760 us/op 6.3710 us/op 1.05
hashTreeRoot SignedContributionAndProof - struct 26.790 us/op 24.215 us/op 1.11
hashTreeRoot SignedContributionAndProof - tree 20.062 us/op 19.253 us/op 1.04
hashTreeRoot SignedBeaconBlock - struct 2.5356 ms/op 2.1739 ms/op 1.17
hashTreeRoot SignedBeaconBlock - tree 1.7796 ms/op 1.6946 ms/op 1.05
hashTreeRoot Validator - struct 12.951 us/op 12.096 us/op 1.07
hashTreeRoot Validator - tree 11.074 us/op 10.355 us/op 1.07
BeaconState vc 300000 - hashTreeRoot tree 3.6886 s/op 3.6525 s/op 1.01
BeaconState.historicalRoots vc 300000 - hashTreeRoot tree 1.3500 us/op 1.3400 us/op 1.01
BeaconState.validators vc 300000 - hashTreeRoot tree 3.4979 s/op 3.4974 s/op 1.00
BeaconState.balances vc 300000 - hashTreeRoot tree 86.933 ms/op 86.452 ms/op 1.01
BeaconState.previousEpochParticipation vc 300000 - hashTreeRoot tree 9.0174 ms/op 9.0131 ms/op 1.00
BeaconState.currentEpochParticipation vc 300000 - hashTreeRoot tree 9.0452 ms/op 9.0085 ms/op 1.00
BeaconState.inactivityScores vc 300000 - hashTreeRoot tree 88.884 ms/op 86.569 ms/op 1.03
hash64 x18 19.557 us/op 19.358 us/op 1.01
hashTwoObjects x18 18.413 us/op 17.861 us/op 1.03
hash64 x1740 1.8220 ms/op 1.8124 ms/op 1.01
hashTwoObjects x1740 1.7030 ms/op 1.7224 ms/op 0.99
hash64 x2700000 2.8527 s/op 2.8213 s/op 1.01
hashTwoObjects x2700000 2.6502 s/op 2.6376 s/op 1.00
get_exitEpoch - ContainerType 226.00 ns/op 190.00 ns/op 1.19
get_exitEpoch - ContainerNodeStructType 231.00 ns/op 190.00 ns/op 1.22
set_exitEpoch - ContainerType 239.00 ns/op 254.00 ns/op 0.94
set_exitEpoch - ContainerNodeStructType 237.00 ns/op 204.00 ns/op 1.16
get_pubkey - ContainerType 894.00 ns/op 854.00 ns/op 1.05
get_pubkey - ContainerNodeStructType 233.00 ns/op 201.00 ns/op 1.16
hashTreeRoot - ContainerType 371.00 ns/op 337.00 ns/op 1.10
hashTreeRoot - ContainerNodeStructType 446.00 ns/op 378.00 ns/op 1.18
createProof - ContainerType 4.2990 us/op 3.7110 us/op 1.16
createProof - ContainerNodeStructType 21.894 us/op 19.853 us/op 1.10
serialize - ContainerType 1.8750 us/op 1.7860 us/op 1.05
serialize - ContainerNodeStructType 1.5420 us/op 1.5830 us/op 0.97
set_exitEpoch_and_hashTreeRoot - ContainerType 4.2740 us/op 4.1860 us/op 1.02
set_exitEpoch_and_hashTreeRoot - ContainerNodeStructType 11.401 us/op 11.102 us/op 1.03
Array - for of 5.5600 us/op 5.6380 us/op 0.99
Array - for(;;) 5.5480 us/op 5.4620 us/op 1.02
basicListValue.readonlyValuesArray() 4.3692 ms/op 4.2076 ms/op 1.04
basicListValue.readonlyValuesArray() + loop all 5.2851 ms/op 4.1542 ms/op 1.27
compositeListValue.readonlyValuesArray() 29.942 ms/op 27.561 ms/op 1.09
compositeListValue.readonlyValuesArray() + loop all 29.698 ms/op 29.214 ms/op 1.02
Number64UintType - get balances list 4.2828 ms/op 4.3291 ms/op 0.99
Number64UintType - set balances list 9.5034 ms/op 10.021 ms/op 0.95
Number64UintType - get and increase 10 then set 39.115 ms/op 40.389 ms/op 0.97
Number64UintType - increase 10 using applyDelta 15.591 ms/op 17.193 ms/op 0.91
Number64UintType - increase 10 using applyDeltaInBatch 15.269 ms/op 17.224 ms/op 0.89
tree_newTreeFromUint64Deltas 16.533 ms/op 13.377 ms/op 1.24
unsafeUint8ArrayToTree 29.468 ms/op 26.745 ms/op 1.10
bitLength(50) 216.00 ns/op 203.00 ns/op 1.06
bitLengthStr(50) 209.00 ns/op 193.00 ns/op 1.08
bitLength(8000) 201.00 ns/op 197.00 ns/op 1.02
bitLengthStr(8000) 255.00 ns/op 245.00 ns/op 1.04
bitLength(250000) 223.00 ns/op 208.00 ns/op 1.07
bitLengthStr(250000) 314.00 ns/op 297.00 ns/op 1.06
floor - Math.floor (53) 1.2371 ns/op 1.2564 ns/op 0.98
floor - << 0 (53) 1.2366 ns/op 1.2374 ns/op 1.00
floor - Math.floor (512) 1.2370 ns/op 1.2365 ns/op 1.00
floor - << 0 (512) 1.2553 ns/op 1.2364 ns/op 1.02
fnIf(0) 1.5527 ns/op 1.5548 ns/op 1.00
fnSwitch(0) 2.1715 ns/op 2.1661 ns/op 1.00
fnObj(0) 1.5467 ns/op 1.5695 ns/op 0.99
fnArr(0) 1.5472 ns/op 1.5471 ns/op 1.00
fnIf(4) 2.1654 ns/op 2.1932 ns/op 0.99
fnSwitch(4) 2.1660 ns/op 2.1642 ns/op 1.00
fnObj(4) 1.5546 ns/op 1.5485 ns/op 1.00
fnArr(4) 1.5475 ns/op 1.5481 ns/op 1.00
fnIf(9) 3.1564 ns/op 3.0949 ns/op 1.02
fnSwitch(9) 2.1665 ns/op 2.1954 ns/op 0.99
fnObj(9) 1.5461 ns/op 1.5493 ns/op 1.00
fnArr(9) 1.5531 ns/op 1.5497 ns/op 1.00
Container {a,b,vec} - as struct x100000 124.07 us/op 123.91 us/op 1.00
Container {a,b,vec} - as tree x100000 340.37 us/op 340.30 us/op 1.00
Container {a,vec,b} - as struct x100000 157.79 us/op 154.77 us/op 1.02
Container {a,vec,b} - as tree x100000 371.42 us/op 372.12 us/op 1.00
get 2 props x1000000 - rawObject 309.44 us/op 310.81 us/op 1.00
get 2 props x1000000 - proxy 73.948 ms/op 72.741 ms/op 1.02
get 2 props x1000000 - customObj 309.77 us/op 309.33 us/op 1.00
Simple object binary -> struct 861.00 ns/op 795.00 ns/op 1.08
Simple object binary -> tree_backed 1.6640 us/op 1.5580 us/op 1.07
Simple object struct -> tree_backed 2.3310 us/op 2.1900 us/op 1.06
Simple object tree_backed -> struct 2.2450 us/op 2.1540 us/op 1.04
Simple object struct -> binary 1.0160 us/op 1.0830 us/op 0.94
Simple object tree_backed -> binary 1.5700 us/op 1.5820 us/op 0.99
aggregationBits binary -> struct 627.00 ns/op 589.00 ns/op 1.06
aggregationBits binary -> tree_backed 2.4090 us/op 2.3670 us/op 1.02
aggregationBits struct -> tree_backed 2.8380 us/op 2.8010 us/op 1.01
aggregationBits tree_backed -> struct 1.2140 us/op 1.1880 us/op 1.02
aggregationBits struct -> binary 797.00 ns/op 774.00 ns/op 1.03
aggregationBits tree_backed -> binary 1.0750 us/op 1.0300 us/op 1.04
List(uint8) 100000 binary -> struct 1.3397 ms/op 1.4490 ms/op 0.92
List(uint8) 100000 binary -> tree_backed 93.770 us/op 88.515 us/op 1.06
List(uint8) 100000 struct -> tree_backed 1.1678 ms/op 1.1905 ms/op 0.98
List(uint8) 100000 tree_backed -> struct 1.0327 ms/op 1.0591 ms/op 0.98
List(uint8) 100000 struct -> binary 988.12 us/op 1.0094 ms/op 0.98
List(uint8) 100000 tree_backed -> binary 88.551 us/op 87.930 us/op 1.01
List(uint64Number) 100000 binary -> struct 1.2350 ms/op 1.2081 ms/op 1.02
List(uint64Number) 100000 binary -> tree_backed 2.8315 ms/op 3.2269 ms/op 0.88
List(uint64Number) 100000 struct -> tree_backed 3.9792 ms/op 4.8569 ms/op 0.82
List(uint64Number) 100000 tree_backed -> struct 2.0545 ms/op 2.3570 ms/op 0.87
List(uint64Number) 100000 struct -> binary 1.3642 ms/op 1.5680 ms/op 0.87
List(uint64Number) 100000 tree_backed -> binary 810.64 us/op 905.40 us/op 0.90
List(Uint64Bigint) 100000 binary -> struct 3.5439 ms/op 3.6912 ms/op 0.96
List(Uint64Bigint) 100000 binary -> tree_backed 3.2928 ms/op 3.3661 ms/op 0.98
List(Uint64Bigint) 100000 struct -> tree_backed 5.2914 ms/op 5.5335 ms/op 0.96
List(Uint64Bigint) 100000 tree_backed -> struct 4.5456 ms/op 4.6956 ms/op 0.97
List(Uint64Bigint) 100000 struct -> binary 2.0308 ms/op 2.0423 ms/op 0.99
List(Uint64Bigint) 100000 tree_backed -> binary 982.22 us/op 1.1645 ms/op 0.84
Vector(Root) 100000 binary -> struct 28.981 ms/op 31.484 ms/op 0.92
Vector(Root) 100000 binary -> tree_backed 32.772 ms/op 33.719 ms/op 0.97
Vector(Root) 100000 struct -> tree_backed 37.789 ms/op 37.528 ms/op 1.01
Vector(Root) 100000 tree_backed -> struct 44.906 ms/op 45.449 ms/op 0.99
Vector(Root) 100000 struct -> binary 2.6262 ms/op 2.5929 ms/op 1.01
Vector(Root) 100000 tree_backed -> binary 9.5413 ms/op 10.302 ms/op 0.93
List(Validator) 100000 binary -> struct 105.60 ms/op 108.18 ms/op 0.98
List(Validator) 100000 binary -> tree_backed 288.03 ms/op 290.31 ms/op 0.99
List(Validator) 100000 struct -> tree_backed 295.83 ms/op 302.03 ms/op 0.98
List(Validator) 100000 tree_backed -> struct 190.95 ms/op 192.89 ms/op 0.99
List(Validator) 100000 struct -> binary 26.600 ms/op 27.086 ms/op 0.98
List(Validator) 100000 tree_backed -> binary 101.26 ms/op 101.01 ms/op 1.00
List(Validator-NS) 100000 binary -> struct 98.635 ms/op 105.24 ms/op 0.94
List(Validator-NS) 100000 binary -> tree_backed 146.63 ms/op 144.50 ms/op 1.01
List(Validator-NS) 100000 struct -> tree_backed 173.36 ms/op 173.97 ms/op 1.00
List(Validator-NS) 100000 tree_backed -> struct 144.68 ms/op 146.22 ms/op 0.99
List(Validator-NS) 100000 struct -> binary 26.798 ms/op 27.026 ms/op 0.99
List(Validator-NS) 100000 tree_backed -> binary 33.001 ms/op 32.982 ms/op 1.00
get epochStatuses - MutableVector 90.933 us/op 104.84 us/op 0.87
get epochStatuses - ViewDU 208.96 us/op 208.53 us/op 1.00
set epochStatuses - ListTreeView 1.4093 ms/op 1.6046 ms/op 0.88
set epochStatuses - ListTreeView - set() 440.21 us/op 457.65 us/op 0.96
set epochStatuses - ListTreeView - commit() 446.39 us/op 438.80 us/op 1.02
bitstring 641.44 ns/op 645.17 ns/op 0.99
bit mask 13.464 ns/op 14.232 ns/op 0.95
struct - increase slot to 1000000 928.47 us/op 927.45 us/op 1.00
UintNumberType - increase slot to 1000000 21.668 ms/op 23.901 ms/op 0.91
UintBigintType - increase slot to 1000000 166.59 ms/op 200.68 ms/op 0.83
UintBigint8 x 100000 tree_deserialize 4.5355 ms/op 5.2920 ms/op 0.86
UintBigint8 x 100000 tree_serialize 1.0914 ms/op 1.0923 ms/op 1.00
UintBigint16 x 100000 tree_deserialize 4.5547 ms/op 6.1811 ms/op 0.74
UintBigint16 x 100000 tree_serialize 1.1746 ms/op 1.5894 ms/op 0.74
UintBigint32 x 100000 tree_deserialize 4.7314 ms/op 5.8123 ms/op 0.81
UintBigint32 x 100000 tree_serialize 1.1852 ms/op 1.4116 ms/op 0.84
UintBigint64 x 100000 tree_deserialize 4.9360 ms/op 6.5494 ms/op 0.75
UintBigint64 x 100000 tree_serialize 1.5536 ms/op 1.9879 ms/op 0.78
UintBigint8 x 100000 value_deserialize 432.91 us/op 432.99 us/op 1.00
UintBigint8 x 100000 value_serialize 623.87 us/op 708.83 us/op 0.88
UintBigint16 x 100000 value_deserialize 466.47 us/op 464.54 us/op 1.00
UintBigint16 x 100000 value_serialize 709.62 us/op 788.61 us/op 0.90
UintBigint32 x 100000 value_deserialize 433.18 us/op 433.86 us/op 1.00
UintBigint32 x 100000 value_serialize 660.54 us/op 786.64 us/op 0.84
UintBigint64 x 100000 value_deserialize 495.88 us/op 510.50 us/op 0.97
UintBigint64 x 100000 value_serialize 850.03 us/op 1.0409 ms/op 0.82
UintBigint8 x 100000 deserialize 2.8597 ms/op 3.6057 ms/op 0.79
UintBigint8 x 100000 serialize 1.4574 ms/op 1.6029 ms/op 0.91
UintBigint16 x 100000 deserialize 2.8137 ms/op 3.1933 ms/op 0.88
UintBigint16 x 100000 serialize 1.4876 ms/op 1.5637 ms/op 0.95
UintBigint32 x 100000 deserialize 2.7950 ms/op 3.2083 ms/op 0.87
UintBigint32 x 100000 serialize 2.7531 ms/op 2.9506 ms/op 0.93
UintBigint64 x 100000 deserialize 3.7903 ms/op 3.8717 ms/op 0.98
UintBigint64 x 100000 serialize 1.5308 ms/op 1.5096 ms/op 1.01
UintBigint128 x 100000 deserialize 5.4717 ms/op 5.0612 ms/op 1.08
UintBigint128 x 100000 serialize 14.511 ms/op 14.205 ms/op 1.02
UintBigint256 x 100000 deserialize 7.7624 ms/op 8.0662 ms/op 0.96
UintBigint256 x 100000 serialize 42.970 ms/op 42.049 ms/op 1.02
Slice from Uint8Array x25000 1.1213 ms/op 1.1554 ms/op 0.97
Slice from ArrayBuffer x25000 16.798 ms/op 16.639 ms/op 1.01
Slice from ArrayBuffer x25000 + new Uint8Array 18.801 ms/op 18.124 ms/op 1.04
Copy Uint8Array 100000 iterate 1.6477 ms/op 1.6601 ms/op 0.99
Copy Uint8Array 100000 slice 104.80 us/op 130.82 us/op 0.80
Copy Uint8Array 100000 Uint8Array.prototype.slice.call 110.86 us/op 137.70 us/op 0.81
Copy Buffer 100000 Uint8Array.prototype.slice.call 110.70 us/op 130.41 us/op 0.85
Copy Uint8Array 100000 slice + set 176.37 us/op 238.49 us/op 0.74
Copy Uint8Array 100000 subarray + set 112.81 us/op 127.50 us/op 0.88
Copy Uint8Array 100000 slice arrayBuffer 116.61 us/op 130.35 us/op 0.89
Uint64 deserialize 100000 - iterate Uint8Array 1.7804 ms/op 1.8916 ms/op 0.94
Uint64 deserialize 100000 - by Uint32A 1.8257 ms/op 1.9184 ms/op 0.95
Uint64 deserialize 100000 - by DataView.getUint32 x2 1.8503 ms/op 1.9187 ms/op 0.96
Uint64 deserialize 100000 - by DataView.getBigUint64 5.0285 ms/op 5.0542 ms/op 0.99
Uint64 deserialize 100000 - by byte 40.106 ms/op 40.585 ms/op 0.99

by benchmarkbot/action

@twoeths
Copy link
Contributor Author

twoeths commented Apr 19, 2024

the performance of simd implementation really depends on the cpu, below is simd vs digest64

  • in CI (ubuntu), simd is just a little bit faster
Screenshot 2024-04-19 at 10 17 20
  • in my environment (Mac M1) simd is ~20% faster
  digest64 vs hash4Input64s vs hash8HashObjects
    ✓ digest64 200092 times                                               6.206878 ops/s    161.1116 ms/op        -         60 runs   10.3 s
    ✓ hash 200092 times using hash4Input64s                               7.460423 ops/s    134.0406 ms/op        -         72 runs   10.2 s
    ✓ hash 200092 times using hash8HashObjects                            7.834839 ops/s    127.6350 ms/op        -         76 runs   10.2 s
  • in another ubuntu server (which is used for running a lodestar beacon node), simd is almost 2x faster
digest64 vs hash4Input64s vs hash8HashObjects
    ✓ digest64 200092 times                                               4.908615 ops/s    203.7235 ms/op        -         47 runs   10.2 s
    ✓ hash 200092 times using hash4Input64s                               9.644699 ops/s    103.6839 ms/op        -         94 runs   10.3 s
    ✓ hash 200092 times using hash8HashObjects                            9.390349 ops/s    106.4923 ms/op        -         90 runs   10.1 s

@twoeths twoeths marked this pull request as ready for review April 19, 2024 03:22
@twoeths twoeths requested a review from a team as a code owner April 19, 2024 03:22
g11tech
g11tech previously approved these changes Apr 23, 2024
Copy link
Contributor

@g11tech g11tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@twoeths twoeths enabled auto-merge (squash) April 28, 2024 00:12
@twoeths twoeths disabled auto-merge April 28, 2024 00:12
packages/as-sha256/assembly/utils/v128.ts Outdated Show resolved Hide resolved
packages/as-sha256/src/index.ts Outdated Show resolved Hide resolved
wemeetagain
wemeetagain previously approved these changes May 22, 2024
@twoeths twoeths merged commit ec123ec into master May 24, 2024
8 checks passed
@twoeths twoeths deleted the tuyen/digest_64_simd_2 branch May 24, 2024 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SIMD implementation for as-sha256
3 participants