Vectorize (vec4) matmul and convolutions. #129

nsthorat · 2017-09-14T19:27:03Z

When taking dot products for matmul / conv, break them into sum of vec4 dot products. We see large wins.

Matmul (before):

Matmul (after):

Conv (before):

Conv (after):

Max pool (before):

Max pool (after):

This change is

dsmilkov · 2017-09-17T01:33:18Z

Reviewed 13 of 13 files at r1.
Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

demos/benchmarks/avg_pool_cpu_benchmark.ts, line 20 at r1 (raw file):

mport * as conv_util from '../../src/math/conv_util';
import {NDArrayMathCPU} from '../../src/math/math_cpu';
import {Array3D} from '../../src/math/ndarray';

import public api whenever you can, here and elsewhere in this PR

demos/benchmarks/avg_pool_gpu_benchmark.ts, line 47 at r1 (raw file):

gpgpu_math.runProgram(binary, [x], res);

to improve accuracy, how about calling gpgpu_math.runProgram once more, outside the for loop, before you start measuring time, as a warmup. here and in all other benchmarks. this will trigger upload of x and res to the gpu. Otherwise given 40 op runs, the first run might have a significant impact on the average (it might not, but worth a try).

demos/benchmarks/math-benchmark-run-groups.ts, line 62 at r1 (raw file):

d1=16

seems like depth is 512 in maxpool_gpu_benchmark. can you also double-check the other numbers for the other benchmarks (important not to give users bad perf results :))

demos/benchmarks/max_pool_cpu_benchmark.ts, line 36 at r1 (raw file):

1

outputDepth 512 to match the GPU? and speaking of this, how about centralizing these benchmark params to one place, otherwise we get discrepancies between
3 places: cpu/gpu/run groups metadata - which is what users see

src/math/webgl/pool_gpu.ts, line 117 at r1 (raw file):

} else {

looks like you can early return on the if before, remove this else, and un-indent the block

src/math/webgl/pool_gpu.ts, line 145 at r1 (raw file):

int wR = f / ${filterWidth};

This can probably be optimized further, but I'm not 100% sure. Is there a way to conclude that wR will have the same value for f,f+1, f+2, and f+3. if yes, you can move it outside getValue() and avoid computing it 4 times.

src/math/webgl/pool_gpu.ts, line 151 at r1 (raw file):

if (xR < 0 || xR >= ${xNumRows}

might improve perf: break this if in half, and move the part related to xRright after you compute xR.

src/math/webgl/pool_gpu.ts, line 172 at r1 (raw file):

          float avgValue = 0.0;

          for (int f = 0; f < ${filterNearestVec4}; f+=4) {

This might be too much to ask, but would be really good to understand the benefit of flattening out the weight indexing by benchmarking also the version where you keep two loops over wR and wC and convert to vec4 only the inner most loop.

If anything, this will help us understand how much room there is left to improve the flattened version, especially if the flat version is not faster than the un-flat version.

src/math/webgl/shader_compiler.ts, line 170 at r1 (raw file):

return !equals.w || !equals.x || !equals.y || !equals.z;

a faster version might be: return any(notEqual(values, values))

src/math/webgl/shader_compiler.ts, line 173 at r1 (raw file):

float getNaN(vec4 values)

gosh this doesn't look great. if there was a way to produce NaN in glsl :) I researched this at one point and didn't find a solution. I tried again now and seems like a cleaner way would be to pass a NaN as uniform
uniform float u_NaN for every program.

other people claim success with sqrt(-1.0); but hard to tell without testing on all browsers.

Comments from Reviewable

nsthorat · 2017-09-17T18:49:36Z

Review status: 5 of 22 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

demos/benchmarks/math-benchmark-run-groups.ts, line 62 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

d1=16

seems like depth is 512 in maxpool_gpu_benchmark. can you also double-check the other numbers for the other benchmarks (important not to give users bad perf results :))

Consolidated everything so param sharing is by design

demos/benchmarks/max_pool_cpu_benchmark.ts, line 36 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

1

outputDepth 512 to match the GPU? and speaking of this, how about centralizing these benchmark params to one place, otherwise we get discrepancies between
3 places: cpu/gpu/run groups metadata - which is what users see

Done.

src/math/webgl/pool_gpu.ts, line 117 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

} else {

looks like you can early return on the if before, remove this else, and un-indent the block

Done.

src/math/webgl/pool_gpu.ts, line 145 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

int wR = f / ${filterWidth};

This can probably be optimized further, but I'm not 100% sure. Is there a way to conclude that wR will have the same value for f,f+1, f+2, and f+3. if yes, you can move it outside getValue() and avoid computing it 4 times.

Moved to just optimizing inner loop. That lets us do strided reads, so this is obsolute

src/math/webgl/pool_gpu.ts, line 151 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

if (xR < 0 || xR >= ${xNumRows}

might improve perf: break this if in half, and move the part related to xRright after you compute xR.

I had done this originally, but it doesn't matter since all branches are taken. This just makes the code easier to read. Obsolete now.

src/math/webgl/pool_gpu.ts, line 172 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

This might be too much to ask, but would be really good to understand the benefit of flattening out the weight indexing by benchmarking also the version where you keep two loops over wR and wC and convert to vec4 only the inner most loop.

If anything, this will help us understand how much room there is left to improve the flattened version, especially if the flat version is not faster than the un-flat version.

It's pretty much the same in terms of performance, but I ended up doing 2 loops because we can optimize the read later.

src/math/webgl/shader_compiler.ts, line 170 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

return !equals.w || !equals.x || !equals.y || !equals.z;

a faster version might be: return any(notEqual(values, values))

Done.

src/math/webgl/shader_compiler.ts, line 173 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

float getNaN(vec4 values)

gosh this doesn't look great. if there was a way to produce NaN in glsl :) I researched this at one point and didn't find a solution. I tried again now and seems like a cleaner way would be to pass a NaN as uniform
uniform float u_NaN for every program.

other people claim success with sqrt(-1.0); but hard to tell without testing on all browsers.

Unfortunately we can't do this, for several reasons. First of all NaN is implementation specific, and second if you always upload a NaN, we error out if the uniform isn't present (the chrome fragment shader compile transitively doesn't see a NaN in the program). This doesn't look great, but it's the best we can do (and vectorization is still much faster with this check).

demos/benchmarks/avg_pool_cpu_benchmark.ts, line 20 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

mport * as conv_util from '../../src/math/conv_util';
import {NDArrayMathCPU} from '../../src/math/math_cpu';
import {Array3D} from '../../src/math/ndarray';

import public api whenever you can, here and elsewhere in this PR

Done.

demos/benchmarks/avg_pool_gpu_benchmark.ts, line 47 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

gpgpu_math.runProgram(binary, [x], res);

to improve accuracy, how about calling gpgpu_math.runProgram once more, outside the for loop, before you start measuring time, as a warmup. here and in all other benchmarks. this will trigger upload of x and res to the gpu. Otherwise given 40 op runs, the first run might have a significant impact on the average (it might not, but worth a try).

You can't do that - runProgram is non-blocking, if we leave the first one out, we're actually incorrectly timing ourselves (imagine CPU finishes in 10ms, but the GPU command takes 1 second, we would see a slowdown). Charles has always said upload is negligible, which is why he didn't do this in the first place (and the warmup is split here since compiling a shader program is blocking). I'm going to add disjoint query timers in a follow up CL, so I don't think it's worth optimizing this method.

Comments from Reviewable

nsthorat · 2017-09-17T18:50:54Z

Review status: 5 of 22 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

demos/benchmarks/math-benchmark-run-groups.ts, line 62 at r1 (raw file):