Use CPU copy with SharedStorage #445

christiangnrd · 2024-10-02T19:05:05Z

Use CPU copy for shared storage arrays to avoid ObjectiveC.jl overhead.

Is this even a good idea?

Depends on #452

github-actions

Metal Benchmarks

Benchmark suite	Current: `e9ac0d2`	Previous: `ff7c7eb`	Ratio
`private array/construct`	`27208.333333333332` ns	`26687.5` ns	`1.02`
`private array/broadcast`	`455584` ns	`465979.5` ns	`0.98`
`private array/random/randn/Float32`	`1011500` ns	`993270.5` ns	`1.02`
`private array/random/randn!/Float32`	`631583` ns	`632166.5` ns	`1.00`
`private array/random/rand!/Int64`	`577417` ns	`568500` ns	`1.02`
`private array/random/rand!/Float32`	`586000` ns	`583500` ns	`1.00`
`private array/random/rand/Int64`	`877125` ns	`880458` ns	`1.00`
`private array/random/rand/Float32`	`703750` ns	`844333.5` ns	`0.83`
`private array/copyto!/gpu_to_gpu`	`622250` ns	`614333` ns	`1.01`
`private array/copyto!/cpu_to_gpu`	`692250` ns	`739479` ns	`0.94`
`private array/copyto!/gpu_to_cpu`	`594083.5` ns	`599208` ns	`0.99`
`private array/accumulate/1d`	`1434083` ns	`1447750.5` ns	`0.99`
`private array/accumulate/2d`	`1479500` ns	`1496375` ns	`0.99`
`private array/iteration/findall/int`	`2218500` ns	`2263917` ns	`0.98`
`private array/iteration/findall/bool`	`2002187.5` ns	`1989875` ns	`1.01`
`private array/iteration/findfirst/int`	`1688250` ns	`1678000` ns	`1.01`
`private array/iteration/findfirst/bool`	`1650625` ns	`1663625` ns	`0.99`
`private array/iteration/scalar`	`2399750` ns	`2393834` ns	`1.00`
`private array/iteration/logical`	`3446416` ns	`3431520.5` ns	`1.00`
`private array/iteration/findmin/1d`	`1757084` ns	`1794125` ns	`0.98`
`private array/iteration/findmin/2d`	`1358875` ns	`1403416` ns	`0.97`
`private array/reductions/reduce/1d`	`800917` ns	`805792` ns	`0.99`
`private array/reductions/reduce/2d`	`700479.5` ns	`704146` ns	`0.99`
`private array/reductions/mapreduce/1d`	`811125` ns	`815812.5` ns	`0.99`
`private array/reductions/mapreduce/2d`	`701166.5` ns	`716666.5` ns	`0.98`
`private array/permutedims/4d`	`947645.5` ns	`943959` ns	`1.00`
`private array/permutedims/2d`	`950791` ns	`938875` ns	`1.01`
`private array/permutedims/3d`	`1007916` ns	`1005416.5` ns	`1.00`
`private array/copy`	`876354.5` ns	`862875` ns	`1.02`
`latency/precompile`	`4414162875` ns	`4407793041` ns	`1.00`
`latency/ttfp`	`6916084749.5` ns	`6915521687.5` ns	`1.00`
`latency/import`	`726415791.5` ns	`726643917` ns	`1.00`
`integration/metaldevrt`	`743792` ns	`749270.5` ns	`0.99`
`integration/byval/slices=1`	`1482750` ns	`1557959` ns	`0.95`
`integration/byval/slices=3`	`8832249.5` ns	`8832020.5` ns	`1.00`
`integration/byval/reference`	`1515979` ns	`1611291` ns	`0.94`
`integration/byval/slices=2`	`2747375` ns	`2583750` ns	`1.06`
`kernel/indexing`	`469583` ns	`476584` ns	`0.99`
`kernel/indexing_checked`	`444083` ns	`441500` ns	`1.01`
`kernel/launch`	`11125` ns	`10875` ns	`1.02`
`metal/synchronization/stream`	`19292` ns	`19208` ns	`1.00`
`metal/synchronization/context`	`19792` ns	`19750` ns	`1.00`
`shared array/construct`	`24017.416666666664` ns	`23756.916666666664` ns	`1.01`
`shared array/broadcast`	`466625` ns	`469584` ns	`0.99`
`shared array/random/randn/Float32`	`1024625` ns	`1020166` ns	`1.00`
`shared array/random/randn!/Float32`	`632917` ns	`634458` ns	`1.00`
`shared array/random/rand!/Int64`	`579292` ns	`572000` ns	`1.01`
`shared array/random/rand!/Float32`	`598750` ns	`593208.5` ns	`1.01`
`shared array/random/rand/Int64`	`862833` ns	`742792` ns	`1.16`
`shared array/random/rand/Float32`	`883625` ns	`898812.5` ns	`0.98`
`shared array/copyto!/gpu_to_gpu`	`97125` ns	`659667` ns	`0.15`
`shared array/copyto!/cpu_to_gpu`	`87542` ns	`94458` ns	`0.93`
`shared array/copyto!/gpu_to_cpu`	`82041` ns	`84333` ns	`0.97`
`shared array/accumulate/1d`	`1434500` ns	`1418250` ns	`1.01`
`shared array/accumulate/2d`	`1492917` ns	`1500167` ns	`1.00`
`shared array/iteration/findall/int`	`1972125` ns	`1939666` ns	`1.02`
`shared array/iteration/findall/bool`	`1780625` ns	`1746333` ns	`1.02`
`shared array/iteration/findfirst/int`	`1405208` ns	`1413458` ns	`0.99`
`shared array/iteration/findfirst/bool`	`1369834` ns	`1374750` ns	`1.00`
`shared array/iteration/scalar`	`187667` ns	`189167` ns	`0.99`
`shared array/iteration/logical`	`3193624.5` ns	`3212770.5` ns	`0.99`
`shared array/iteration/findmin/1d`	`1460500` ns	`1481709` ns	`0.99`
`shared array/iteration/findmin/2d`	`1374084` ns	`1379250` ns	`1.00`
`shared array/reductions/reduce/1d`	`673729` ns	`659583` ns	`1.02`
`shared array/reductions/reduce/2d`	`698209` ns	`706354` ns	`0.99`
`shared array/reductions/mapreduce/1d`	`631187` ns	`620667` ns	`1.02`
`shared array/reductions/mapreduce/2d`	`706416.5` ns	`704958.5` ns	`1.00`
`shared array/permutedims/4d`	`954291` ns	`963438` ns	`0.99`
`shared array/permutedims/2d`	`918604` ns	`939020.5` ns	`0.98`
`shared array/permutedims/3d`	`1013459` ns	`1003520.5` ns	`1.01`
`shared array/copy`	`239958.5` ns	`880541` ns	`0.27`

This comment was automatically generated by workflow using github-action-benchmark.

[only special]

maleadt · 2024-10-07T13:02:09Z

Is this even a good idea?

I think so; we have similar optimizations in CUDA.jl with unified memory. Copies from and to CPU memory are blocking anyway.

github-actions bot reviewed Oct 3, 2024

View reviewed changes

christiangnrd added speculative Note sure if we want this. performance Gotta go fast. labels Oct 4, 2024

christiangnrd marked this pull request as draft October 4, 2024 17:39

christiangnrd force-pushed the fastercopy branch from e9ac0d2 to d72ce7d Compare October 4, 2024 17:46

Use CPU copy with SharedStorage

509664a

[only special]

christiangnrd force-pushed the fastercopy branch from d72ce7d to 509664a Compare October 5, 2024 18:27

Better test error display

20832ac

christiangnrd marked this pull request as ready for review October 7, 2024 17:16

christiangnrd removed the speculative Note sure if we want this. label Oct 7, 2024

maleadt merged commit c4c0e28 into main Oct 8, 2024
2 checks passed

maleadt deleted the fastercopy branch October 8, 2024 08:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CPU copy with SharedStorage #445

Use CPU copy with SharedStorage #445

christiangnrd commented Oct 2, 2024 •

edited

Loading

github-actions bot left a comment

maleadt commented Oct 7, 2024

Use CPU copy with SharedStorage #445

Use CPU copy with SharedStorage #445

Conversation

christiangnrd commented Oct 2, 2024 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

Metal Benchmarks

maleadt commented Oct 7, 2024

christiangnrd commented Oct 2, 2024 •

edited

Loading