Skip to content
This repository has been archived by the owner on Aug 15, 2019. It is now read-only.

Vectorize (vec4) matmul and convolutions. #129

Merged
merged 12 commits into from
Sep 17, 2017
Merged

Vectorize (vec4) matmul and convolutions. #129

merged 12 commits into from
Sep 17, 2017

Conversation

nsthorat
Copy link
Contributor

@nsthorat nsthorat commented Sep 14, 2017

When taking dot products for matmul / conv, break them into sum of vec4 dot products. We see large wins.

Matmul (before):
image
Matmul (after):
image

Conv (before):
image
Conv (after):
image

Max pool (before):
image

Max pool (after):
image


This change is Reviewable

@dsmilkov
Copy link
Contributor

Reviewed 13 of 13 files at r1.
Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


demos/benchmarks/avg_pool_cpu_benchmark.ts, line 20 at r1 (raw file):

mport * as conv_util from '../../src/math/conv_util';
import {NDArrayMathCPU} from '../../src/math/math_cpu';
import {Array3D} from '../../src/math/ndarray';

import public api whenever you can, here and elsewhere in this PR


demos/benchmarks/avg_pool_gpu_benchmark.ts, line 47 at r1 (raw file):

gpgpu_math.runProgram(binary, [x], res);

to improve accuracy, how about calling gpgpu_math.runProgram once more, outside the for loop, before you start measuring time, as a warmup. here and in all other benchmarks. this will trigger upload of x and res to the gpu. Otherwise given 40 op runs, the first run might have a significant impact on the average (it might not, but worth a try).


demos/benchmarks/math-benchmark-run-groups.ts, line 62 at r1 (raw file):

d1=16

seems like depth is 512 in maxpool_gpu_benchmark. can you also double-check the other numbers for the other benchmarks (important not to give users bad perf results :))


demos/benchmarks/max_pool_cpu_benchmark.ts, line 36 at r1 (raw file):

1

outputDepth 512 to match the GPU? and speaking of this, how about centralizing these benchmark params to one place, otherwise we get discrepancies between
3 places: cpu/gpu/run groups metadata - which is what users see


src/math/webgl/pool_gpu.ts, line 117 at r1 (raw file):

} else {

looks like you can early return on the if before, remove this else, and un-indent the block


src/math/webgl/pool_gpu.ts, line 145 at r1 (raw file):

int wR = f / ${filterWidth};

This can probably be optimized further, but I'm not 100% sure. Is there a way to conclude that wR will have the same value for f,f+1, f+2, and f+3. if yes, you can move it outside getValue() and avoid computing it 4 times.


src/math/webgl/pool_gpu.ts, line 151 at r1 (raw file):

if (xR < 0 || xR >= ${xNumRows}

might improve perf: break this if in half, and move the part related to xRright after you compute xR.


src/math/webgl/pool_gpu.ts, line 172 at r1 (raw file):

          float avgValue = 0.0;

          for (int f = 0; f < ${filterNearestVec4}; f+=4) {

This might be too much to ask, but would be really good to understand the benefit of flattening out the weight indexing by benchmarking also the version where you keep two loops over wR and wC and convert to vec4 only the inner most loop.

If anything, this will help us understand how much room there is left to improve the flattened version, especially if the flat version is not faster than the un-flat version.


src/math/webgl/shader_compiler.ts, line 170 at r1 (raw file):

return !equals.w || !equals.x || !equals.y || !equals.z;

a faster version might be: return any(notEqual(values, values))


src/math/webgl/shader_compiler.ts, line 173 at r1 (raw file):

float getNaN(vec4 values)

gosh this doesn't look great. if there was a way to produce NaN in glsl :) I researched this at one point and didn't find a solution. I tried again now and seems like a cleaner way would be to pass a NaN as uniform
uniform float u_NaN for every program.

other people claim success with sqrt(-1.0); but hard to tell without testing on all browsers.


Comments from Reviewable

@nsthorat
Copy link
Contributor Author

Review status: 5 of 22 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


demos/benchmarks/math-benchmark-run-groups.ts, line 62 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

d1=16

seems like depth is 512 in maxpool_gpu_benchmark. can you also double-check the other numbers for the other benchmarks (important not to give users bad perf results :))

Consolidated everything so param sharing is by design


demos/benchmarks/max_pool_cpu_benchmark.ts, line 36 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

1

outputDepth 512 to match the GPU? and speaking of this, how about centralizing these benchmark params to one place, otherwise we get discrepancies between
3 places: cpu/gpu/run groups metadata - which is what users see

Done.


src/math/webgl/pool_gpu.ts, line 117 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

} else {

looks like you can early return on the if before, remove this else, and un-indent the block

Done.


src/math/webgl/pool_gpu.ts, line 145 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

int wR = f / ${filterWidth};

This can probably be optimized further, but I'm not 100% sure. Is there a way to conclude that wR will have the same value for f,f+1, f+2, and f+3. if yes, you can move it outside getValue() and avoid computing it 4 times.

Moved to just optimizing inner loop. That lets us do strided reads, so this is obsolute


src/math/webgl/pool_gpu.ts, line 151 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

if (xR < 0 || xR >= ${xNumRows}

might improve perf: break this if in half, and move the part related to xRright after you compute xR.

I had done this originally, but it doesn't matter since all branches are taken. This just makes the code easier to read. Obsolete now.


src/math/webgl/pool_gpu.ts, line 172 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

This might be too much to ask, but would be really good to understand the benefit of flattening out the weight indexing by benchmarking also the version where you keep two loops over wR and wC and convert to vec4 only the inner most loop.

If anything, this will help us understand how much room there is left to improve the flattened version, especially if the flat version is not faster than the un-flat version.

It's pretty much the same in terms of performance, but I ended up doing 2 loops because we can optimize the read later.


src/math/webgl/shader_compiler.ts, line 170 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

return !equals.w || !equals.x || !equals.y || !equals.z;

a faster version might be: return any(notEqual(values, values))

Done.


src/math/webgl/shader_compiler.ts, line 173 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

float getNaN(vec4 values)

gosh this doesn't look great. if there was a way to produce NaN in glsl :) I researched this at one point and didn't find a solution. I tried again now and seems like a cleaner way would be to pass a NaN as uniform
uniform float u_NaN for every program.

other people claim success with sqrt(-1.0); but hard to tell without testing on all browsers.

Unfortunately we can't do this, for several reasons. First of all NaN is implementation specific, and second if you always upload a NaN, we error out if the uniform isn't present (the chrome fragment shader compile transitively doesn't see a NaN in the program). This doesn't look great, but it's the best we can do (and vectorization is still much faster with this check).


demos/benchmarks/avg_pool_cpu_benchmark.ts, line 20 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

mport * as conv_util from '../../src/math/conv_util';
import {NDArrayMathCPU} from '../../src/math/math_cpu';
import {Array3D} from '../../src/math/ndarray';

import public api whenever you can, here and elsewhere in this PR

Done.


demos/benchmarks/avg_pool_gpu_benchmark.ts, line 47 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

gpgpu_math.runProgram(binary, [x], res);

to improve accuracy, how about calling gpgpu_math.runProgram once more, outside the for loop, before you start measuring time, as a warmup. here and in all other benchmarks. this will trigger upload of x and res to the gpu. Otherwise given 40 op runs, the first run might have a significant impact on the average (it might not, but worth a try).

You can't do that - runProgram is non-blocking, if we leave the first one out, we're actually incorrectly timing ourselves (imagine CPU finishes in 10ms, but the GPU command takes 1 second, we would see a slowdown). Charles has always said upload is negligible, which is why he didn't do this in the first place (and the warmup is split here since compiling a shader program is blocking). I'm going to add disjoint query timers in a follow up CL, so I don't think it's worth optimizing this method.


Comments from Reviewable

@nsthorat
Copy link
Contributor Author

Review status: 5 of 22 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


demos/benchmarks/math-benchmark-run-groups.ts, line 62 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Consolidated everything so param sharing is by design

Done.


demos/benchmarks/max_pool_cpu_benchmark.ts, line 36 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Done.

Done.


src/math/webgl/pool_gpu.ts, line 117 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Done.

Done.


src/math/webgl/pool_gpu.ts, line 145 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Moved to just optimizing inner loop. That lets us do strided reads, so this is obsolute

Done.


src/math/webgl/pool_gpu.ts, line 151 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

I had done this originally, but it doesn't matter since all branches are taken. This just makes the code easier to read. Obsolete now.

Done.


src/math/webgl/pool_gpu.ts, line 172 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

It's pretty much the same in terms of performance, but I ended up doing 2 loops because we can optimize the read later.

Done.


src/math/webgl/shader_compiler.ts, line 170 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Done.

Done.


src/math/webgl/shader_compiler.ts, line 173 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Unfortunately we can't do this, for several reasons. First of all NaN is implementation specific, and second if you always upload a NaN, we error out if the uniform isn't present (the chrome fragment shader compile transitively doesn't see a NaN in the program). This doesn't look great, but it's the best we can do (and vectorization is still much faster with this check).

Done.


demos/benchmarks/avg_pool_cpu_benchmark.ts, line 20 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Done.

Done.


demos/benchmarks/avg_pool_gpu_benchmark.ts, line 47 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

You can't do that - runProgram is non-blocking, if we leave the first one out, we're actually incorrectly timing ourselves (imagine CPU finishes in 10ms, but the GPU command takes 1 second, we would see a slowdown). Charles has always said upload is negligible, which is why he didn't do this in the first place (and the warmup is split here since compiling a shader program is blocking). I'm going to add disjoint query timers in a follow up CL, so I don't think it's worth optimizing this method.

Done.


Comments from Reviewable

@dsmilkov
Copy link
Contributor

:lgtm_strong:


Review status: 5 of 22 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

Nikhil Thorat added 2 commits September 17, 2017 17:43
@nsthorat nsthorat merged commit 7f87d33 into master Sep 17, 2017
@nsthorat nsthorat deleted the vec4 branch September 17, 2017 21:52
mnottheone pushed a commit to mnottheone/deeplearnjs that referenced this pull request Dec 1, 2018
* vec4 matmul and conv

* for conv benchmarks, make input depth 10

* update depth in html to reflect new 10 input depth

* checkpointing, pooling ops faster now

* add avg pooling benchmark

* change conv depth to 16 in benchmark

* Add unit tests

* fix lint

* respond to comments

* lint
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants