-
Notifications
You must be signed in to change notification settings - Fork 949
Vectorize (vec4) matmul and convolutions. #129
Conversation
Reviewed 13 of 13 files at r1. demos/benchmarks/avg_pool_cpu_benchmark.ts, line 20 at r1 (raw file):
import public api whenever you can, here and elsewhere in this PR demos/benchmarks/avg_pool_gpu_benchmark.ts, line 47 at r1 (raw file):
to improve accuracy, how about calling gpgpu_math.runProgram once more, outside the for loop, before you start measuring time, as a warmup. here and in all other benchmarks. this will trigger upload of x and res to the gpu. Otherwise given 40 op runs, the first run might have a significant impact on the average (it might not, but worth a try). demos/benchmarks/math-benchmark-run-groups.ts, line 62 at r1 (raw file):
seems like depth is 512 in maxpool_gpu_benchmark. can you also double-check the other numbers for the other benchmarks (important not to give users bad perf results :)) demos/benchmarks/max_pool_cpu_benchmark.ts, line 36 at r1 (raw file):
outputDepth 512 to match the GPU? and speaking of this, how about centralizing these benchmark params to one place, otherwise we get discrepancies between src/math/webgl/pool_gpu.ts, line 117 at r1 (raw file):
looks like you can early return on the if before, remove this else, and un-indent the block src/math/webgl/pool_gpu.ts, line 145 at r1 (raw file):
This can probably be optimized further, but I'm not 100% sure. Is there a way to conclude that src/math/webgl/pool_gpu.ts, line 151 at r1 (raw file):
might improve perf: break this src/math/webgl/pool_gpu.ts, line 172 at r1 (raw file):
This might be too much to ask, but would be really good to understand the benefit of flattening out the weight indexing by benchmarking also the version where you keep two loops over If anything, this will help us understand how much room there is left to improve the flattened version, especially if the flat version is not faster than the un-flat version. src/math/webgl/shader_compiler.ts, line 170 at r1 (raw file):
a faster version might be: src/math/webgl/shader_compiler.ts, line 173 at r1 (raw file):
gosh this doesn't look great. if there was a way to produce NaN in glsl :) I researched this at one point and didn't find a solution. I tried again now and seems like a cleaner way would be to pass a NaN as uniform other people claim success with Comments from Reviewable |
Review status: 5 of 22 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. demos/benchmarks/math-benchmark-run-groups.ts, line 62 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Consolidated everything so param sharing is by design demos/benchmarks/max_pool_cpu_benchmark.ts, line 36 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done. src/math/webgl/pool_gpu.ts, line 117 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done. src/math/webgl/pool_gpu.ts, line 145 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Moved to just optimizing inner loop. That lets us do strided reads, so this is obsolute src/math/webgl/pool_gpu.ts, line 151 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
I had done this originally, but it doesn't matter since all branches are taken. This just makes the code easier to read. Obsolete now. src/math/webgl/pool_gpu.ts, line 172 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
It's pretty much the same in terms of performance, but I ended up doing 2 loops because we can optimize the read later. src/math/webgl/shader_compiler.ts, line 170 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done. src/math/webgl/shader_compiler.ts, line 173 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Unfortunately we can't do this, for several reasons. First of all NaN is implementation specific, and second if you always upload a NaN, we error out if the uniform isn't present (the chrome fragment shader compile transitively doesn't see a NaN in the program). This doesn't look great, but it's the best we can do (and vectorization is still much faster with this check). demos/benchmarks/avg_pool_cpu_benchmark.ts, line 20 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done. demos/benchmarks/avg_pool_gpu_benchmark.ts, line 47 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
You can't do that - runProgram is non-blocking, if we leave the first one out, we're actually incorrectly timing ourselves (imagine CPU finishes in 10ms, but the GPU command takes 1 second, we would see a slowdown). Charles has always said upload is negligible, which is why he didn't do this in the first place (and the warmup is split here since compiling a shader program is blocking). I'm going to add disjoint query timers in a follow up CL, so I don't think it's worth optimizing this method. Comments from Reviewable |
Review status: 5 of 22 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. demos/benchmarks/math-benchmark-run-groups.ts, line 62 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. demos/benchmarks/max_pool_cpu_benchmark.ts, line 36 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/math/webgl/pool_gpu.ts, line 117 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/math/webgl/pool_gpu.ts, line 145 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/math/webgl/pool_gpu.ts, line 151 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/math/webgl/pool_gpu.ts, line 172 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/math/webgl/shader_compiler.ts, line 170 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/math/webgl/shader_compiler.ts, line 173 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. demos/benchmarks/avg_pool_cpu_benchmark.ts, line 20 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. demos/benchmarks/avg_pool_gpu_benchmark.ts, line 47 at r1 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. Comments from Reviewable |
Review status: 5 of 22 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
* vec4 matmul and conv * for conv benchmarks, make input depth 10 * update depth in html to reflect new 10 input depth * checkpointing, pooling ops faster now * add avg pooling benchmark * change conv depth to 16 in benchmark * Add unit tests * fix lint * respond to comments * lint
When taking dot products for matmul / conv, break them into sum of vec4 dot products. We see large wins.
Matmul (before):
Matmul (after):
Conv (before):
Conv (after):
Max pool (before):
Max pool (after):
This change is