Add threading feature for matrixmultiply when device is CPU #401

kstavro · 2023-01-25T16:49:33Z

matrixmultiply already supports 1 to 4 threads, so its threading feature can possibly be easily integrated to dfdx.

I have already manually cargo added matrixmultiply = {version = "0.3.2", features = ["threading"]} to the .toml file (right after dfdx), but training still looks single-threaded. Needs further investigation.

Edit: I was wrong, just activating threading as above does improve performance. I was right that cpu utilization still looks rather single-threaded, but I guess there is a little more utilization of the remaining threads that helps overall performance.

For comparison (on the 06-mnist example with BATCH_SIZE = 1024 on a Ryzen 5800x3d):

with $Env:MATMUL_NUM_THREADS=1; cargo run --release --example 06-mnist -- .\tmp\:

Found 60000 training images
Epoch 0 in 2.718831s (21.333 batches/s): avg sample loss 0.00090

with $Env:MATMUL_NUM_THREADS=4; cargo run --release --example 06-mnist -- .\tmp\:

Found 60000 training images
Epoch 0 in 1.2467984s (46.519 batches/s): avg sample loss 0.00090

Edit2: Batch size plays a big role to the performance gains, which makes sense, since the larger the batches, the more the overall load lies on the matmul I guess. Which was also why I initially thought that threading doesn't have an effect.

With the original BATCH_SIZE = 32 of the example:

with $Env:MATMUL_NUM_THREADS=1; cargo run --release --example 06-mnist -- .\tmp\:

Found 60000 training images
Epoch 0 in 5.2134919s (359.644 batches/s): avg sample loss 0.00770

with $Env:MATMUL_NUM_THREADS=4; cargo run --release --example 06-mnist -- .\tmp\:

Found 60000 training images
Epoch 0 in 4.372751s (428.792 batches/s): avg sample loss 0.00770

The text was updated successfully, but these errors were encountered:

coreylowman · 2023-01-25T22:07:05Z

Hey yeah this seems like a super easy win. I also just tried with some large convolution benchmark and it really helped there too.

I think we should add this feature as a default.

It looks like it does depend on std, so I think we'd have to turn it off in nostd:

- std = ["no-std-compat/std", "rand/std", "rand_distr/std", "cudarc?/std"]
+ std = ["no-std-compat/std", "rand/std", "rand_distr/std", "cudarc?/std", "matrixmultiply/threading"]

Want to open a PR?

ViliamVadocz · 2023-01-26T09:26:00Z

How does this impact tiny networks? Could the overhead of threading decrease performance?

By tiny networks I mean something on the order of 3 feedfoward layers with 32 weights each.

coreylowman · 2023-01-26T15:16:12Z

@ViliamVadocz good question, can you or @kstavro look into that?

kstavro · 2023-01-26T17:27:42Z

@coreylowman @ViliamVadocz Can you point me out to some meaningful dataset with low dimension (or example from this repo) to try out? I could make my one random dataset, but this would have to wait until the weekend. My level in Rust would require some googling even for the creating the random set as above, whereas in Python it would have taken me 5 mins or so.

coreylowman · 2023-01-26T17:47:14Z

Yeah you can just create random tensors and operate on that:

let x: Tensor<Rank2<64, 3>> = dev.sample_normal();
let y = model(x);

There will be some time spent on sampling from the distribution, but that should be the same time whether you're using matrixmultiply threading or not, so its still safe to compare the two situations

kstavro · 2023-01-27T15:10:32Z

Ok, so I tried to already have the batch tensors generated before the training, so that we can only measure the training times, but since the tensors don't implement the copy trait and cross_entropy_with_logits_loss doesn't accept references (from the collection I had packed all the random tensors in), I did in the end what @coreylowman suggested and sampled each batch inside the trainings loop.

This is the constellation I tried with forward and backward:

const INPUT_DIM: usize = 32;
const BATCH_SIZE: usize = 32;
const NUM_BATCHES: usize = 1000;

// our network structure
type Mlp = (
    (Linear<INPUT_DIM, INPUT_DIM, Dev>, ReLU),
    (Linear<INPUT_DIM, INPUT_DIM, Dev>, ReLU),
    (Linear<INPUT_DIM, INPUT_DIM, Dev>, ReLU),
    Linear<INPUT_DIM, 10, Dev>,
);

with MATMUL_NUM_THREADS=1:

Epoch 0 in 93.6847ms (10674.102 batches/s)
Epoch 1 in 93.2982ms (10718.320 batches/s)
Epoch 2 in 93.1736ms (10732.653 batches/s)
Epoch 3 in 93.1092ms (10740.077 batches/s)
Epoch 4 in 93.6885ms (10673.669 batches/s)
Epoch 5 in 93.5083ms (10694.237 batches/s)
Epoch 6 in 94.1358ms (10622.951 batches/s)
Epoch 7 in 92.9713ms (10756.008 batches/s)
Epoch 8 in 94.1525ms (10621.067 batches/s)
Epoch 9 in 93.8917ms (10650.569 batches/s)
Epoch 10 in 93.7845ms (10662.743 batches/s)
Epoch 11 in 93.964ms (10642.373 batches/s)
Epoch 12 in 93.2793ms (10720.493 batches/s)
Epoch 13 in 93.5685ms (10687.357 batches/s)
Epoch 14 in 93.8829ms (10651.567 batches/s)
Epoch 15 in 92.8858ms (10765.908 batches/s)
Epoch 16 in 94.3197ms (10602.239 batches/s)
Epoch 17 in 94.9412ms (10532.835 batches/s)
Epoch 18 in 94.1452ms (10621.891 batches/s)
Epoch 19 in 93.5499ms (10689.481 batches/s)

with MATMUL_NUM_THREADS=4:

Epoch 0 in 92.8763ms (10767.009 batches/s)
Epoch 1 in 93.8463ms (10655.721 batches/s)
Epoch 2 in 93.0505ms (10746.854 batches/s)
Epoch 3 in 93.2228ms (10726.990 batches/s)
Epoch 4 in 93.7438ms (10667.372 batches/s)
Epoch 5 in 93.0092ms (10751.624 batches/s)
Epoch 6 in 93.5389ms (10690.740 batches/s)
Epoch 7 in 93.9582ms (10643.030 batches/s)
Epoch 8 in 93.6632ms (10676.552 batches/s)
Epoch 9 in 95.1873ms (10505.604 batches/s)
Epoch 10 in 93.4395ms (10702.111 batches/s)
Epoch 12 in 93.2298ms (10726.184 batches/s)
Epoch 13 in 92.834ms (10771.915 batches/s)
Epoch 14 in 93.0201ms (10750.365 batches/s)
Epoch 15 in 93.5465ms (10689.871 batches/s)
Epoch 16 in 94.2627ms (10608.649 batches/s)
Epoch 17 in 93.4911ms (10696.204 batches/s)
Epoch 18 in 93.7193ms (10670.161 batches/s)
Epoch 19 in 93.7442ms (10667.326 batches/s)

Performance is pretty much identical, but since I couldn't get the tensors to be pre-sampled, I measured their impact on each epoch (around 10%):

with MATMUL_NUM_THREADS=1:

Batch sampling 0 in 9.674ms (103369.859 samplings/s)
Batch sampling 1 in 10.1066ms (98945.242 samplings/s)
Batch sampling 2 in 9.9193ms (100813.570 samplings/s)
Batch sampling 3 in 9.9976ms (100024.008 samplings/s)
Batch sampling 4 in 9.9421ms (100582.367 samplings/s)
Batch sampling 5 in 9.9558ms (100443.961 samplings/s)
Batch sampling 6 in 9.943ms (100573.266 samplings/s)
Batch sampling 7 in 10.2383ms (97672.461 samplings/s)
Batch sampling 8 in 10.0382ms (99619.453 samplings/s)
Batch sampling 9 in 10.0456ms (99546.070 samplings/s)
Batch sampling 10 in 9.8453ms (101571.312 samplings/s)
Batch sampling 11 in 9.7879ms (102166.961 samplings/s)
Batch sampling 12 in 10.0004ms (99996.000 samplings/s)
Batch sampling 13 in 10.1479ms (98542.555 samplings/s)
Batch sampling 14 in 9.8581ms (101439.430 samplings/s)
Batch sampling 15 in 9.9786ms (100214.461 samplings/s)
Batch sampling 16 in 10.0615ms (99388.766 samplings/s)
Batch sampling 17 in 10.0925ms (99083.484 samplings/s)
Batch sampling 18 in 9.9659ms (100342.164 samplings/s)
Batch sampling 19 in 9.821ms (101822.633 samplings/s)

with MATMUL_NUM_THREADS=4:

Batch sampling 0 in 10.0156ms (99844.242 samplings/s)
Batch sampling 1 in 9.837ms (101657.016 samplings/s)
Batch sampling 2 in 9.3604ms (106833.039 samplings/s)
Batch sampling 3 in 9.9398ms (100605.648 samplings/s)
Batch sampling 4 in 10.1919ms (98117.133 samplings/s)
Batch sampling 5 in 9.6616ms (103502.523 samplings/s)
Batch sampling 6 in 9.6002ms (104164.500 samplings/s)
Batch sampling 7 in 9.7433ms (102634.625 samplings/s)
Batch sampling 8 in 9.5042ms (105216.641 samplings/s)
Batch sampling 9 in 9.9855ms (100145.211 samplings/s)
Batch sampling 10 in 10.803ms (92566.883 samplings/s)
Batch sampling 12 in 9.9396ms (100607.672 samplings/s)
Batch sampling 13 in 9.9663ms (100338.141 samplings/s)
Batch sampling 14 in 9.8273ms (101757.352 samplings/s)
Batch sampling 15 in 9.6275ms (103869.133 samplings/s)
Batch sampling 16 in 9.7661ms (102395.023 samplings/s)
Batch sampling 17 in 9.7453ms (102613.570 samplings/s)
Batch sampling 18 in 9.9289ms (100716.094 samplings/s)
Batch sampling 19 in 9.6684ms (103429.734 samplings/s)

To me, they look practically identical. @ViliamVadocz I could also try some other combinations of input dimension, batch size and number of batches. Otherwise, I think we can open the PR.

infalmo mentioned this issue Jan 30, 2023

Enable multi-core matmul #417

Merged

coreylowman closed this as completed in #417 Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add threading feature for matrixmultiply when device is CPU #401

Add threading feature for matrixmultiply when device is CPU #401

kstavro commented Jan 25, 2023 •

edited

Loading

coreylowman commented Jan 25, 2023 •

edited

Loading

ViliamVadocz commented Jan 26, 2023

coreylowman commented Jan 26, 2023

kstavro commented Jan 26, 2023

coreylowman commented Jan 26, 2023

kstavro commented Jan 27, 2023

Add threading feature for matrixmultiply when device is CPU #401

Add threading feature for matrixmultiply when device is CPU #401

Comments

kstavro commented Jan 25, 2023 • edited Loading

coreylowman commented Jan 25, 2023 • edited Loading

ViliamVadocz commented Jan 26, 2023

coreylowman commented Jan 26, 2023

kstavro commented Jan 26, 2023

coreylowman commented Jan 26, 2023

kstavro commented Jan 27, 2023

kstavro commented Jan 25, 2023 •

edited

Loading

coreylowman commented Jan 25, 2023 •

edited

Loading