Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add threading feature for matrixmultiply when device is CPU #401

Closed
kstavro opened this issue Jan 25, 2023 · 6 comments · Fixed by #417
Closed

Add threading feature for matrixmultiply when device is CPU #401

kstavro opened this issue Jan 25, 2023 · 6 comments · Fixed by #417

Comments

@kstavro
Copy link
Contributor

kstavro commented Jan 25, 2023

matrixmultiply already supports 1 to 4 threads, so its threading feature can possibly be easily integrated to dfdx.

I have already manually cargo added matrixmultiply = {version = "0.3.2", features = ["threading"]} to the .toml file (right after dfdx), but training still looks single-threaded. Needs further investigation.

Edit: I was wrong, just activating threading as above does improve performance. I was right that cpu utilization still looks rather single-threaded, but I guess there is a little more utilization of the remaining threads that helps overall performance.

For comparison (on the 06-mnist example with BATCH_SIZE = 1024 on a Ryzen 5800x3d):

  • with $Env:MATMUL_NUM_THREADS=1; cargo run --release --example 06-mnist -- .\tmp\:
Found 60000 training images
Epoch 0 in 2.718831s (21.333 batches/s): avg sample loss 0.00090
  • with $Env:MATMUL_NUM_THREADS=4; cargo run --release --example 06-mnist -- .\tmp\:
Found 60000 training images
Epoch 0 in 1.2467984s (46.519 batches/s): avg sample loss 0.00090

Edit2: Batch size plays a big role to the performance gains, which makes sense, since the larger the batches, the more the overall load lies on the matmul I guess. Which was also why I initially thought that threading doesn't have an effect.

With the original BATCH_SIZE = 32 of the example:

  • with $Env:MATMUL_NUM_THREADS=1; cargo run --release --example 06-mnist -- .\tmp\:
Found 60000 training images
Epoch 0 in 5.2134919s (359.644 batches/s): avg sample loss 0.00770
  • with $Env:MATMUL_NUM_THREADS=4; cargo run --release --example 06-mnist -- .\tmp\:
Found 60000 training images
Epoch 0 in 4.372751s (428.792 batches/s): avg sample loss 0.00770
@coreylowman
Copy link
Owner

coreylowman commented Jan 25, 2023

Hey yeah this seems like a super easy win. I also just tried with some large convolution benchmark and it really helped there too.

I think we should add this feature as a default.

It looks like it does depend on std, so I think we'd have to turn it off in nostd:

- std = ["no-std-compat/std", "rand/std", "rand_distr/std", "cudarc?/std"]
+ std = ["no-std-compat/std", "rand/std", "rand_distr/std", "cudarc?/std", "matrixmultiply/threading"]

Want to open a PR?

@ViliamVadocz
Copy link
Contributor

How does this impact tiny networks? Could the overhead of threading decrease performance?

By tiny networks I mean something on the order of 3 feedfoward layers with 32 weights each.

@coreylowman
Copy link
Owner

@ViliamVadocz good question, can you or @kstavro look into that?

@kstavro
Copy link
Contributor Author

kstavro commented Jan 26, 2023

@coreylowman @ViliamVadocz Can you point me out to some meaningful dataset with low dimension (or example from this repo) to try out? I could make my one random dataset, but this would have to wait until the weekend. My level in Rust would require some googling even for the creating the random set as above, whereas in Python it would have taken me 5 mins or so.

@coreylowman
Copy link
Owner

Yeah you can just create random tensors and operate on that:

let x: Tensor<Rank2<64, 3>> = dev.sample_normal();
let y = model(x);

There will be some time spent on sampling from the distribution, but that should be the same time whether you're using matrixmultiply threading or not, so its still safe to compare the two situations

@kstavro
Copy link
Contributor Author

kstavro commented Jan 27, 2023

Ok, so I tried to already have the batch tensors generated before the training, so that we can only measure the training times, but since the tensors don't implement the copy trait and cross_entropy_with_logits_loss doesn't accept references (from the collection I had packed all the random tensors in), I did in the end what @coreylowman suggested and sampled each batch inside the trainings loop.

This is the constellation I tried with forward and backward:

const INPUT_DIM: usize = 32;
const BATCH_SIZE: usize = 32;
const NUM_BATCHES: usize = 1000;

// our network structure
type Mlp = (
    (Linear<INPUT_DIM, INPUT_DIM, Dev>, ReLU),
    (Linear<INPUT_DIM, INPUT_DIM, Dev>, ReLU),
    (Linear<INPUT_DIM, INPUT_DIM, Dev>, ReLU),
    Linear<INPUT_DIM, 10, Dev>,
);
  • with MATMUL_NUM_THREADS=1:
Epoch 0 in 93.6847ms (10674.102 batches/s)
Epoch 1 in 93.2982ms (10718.320 batches/s)
Epoch 2 in 93.1736ms (10732.653 batches/s)
Epoch 3 in 93.1092ms (10740.077 batches/s)
Epoch 4 in 93.6885ms (10673.669 batches/s)
Epoch 5 in 93.5083ms (10694.237 batches/s)
Epoch 6 in 94.1358ms (10622.951 batches/s)
Epoch 7 in 92.9713ms (10756.008 batches/s)
Epoch 8 in 94.1525ms (10621.067 batches/s)
Epoch 9 in 93.8917ms (10650.569 batches/s)
Epoch 10 in 93.7845ms (10662.743 batches/s)
Epoch 11 in 93.964ms (10642.373 batches/s)
Epoch 12 in 93.2793ms (10720.493 batches/s)
Epoch 13 in 93.5685ms (10687.357 batches/s)
Epoch 14 in 93.8829ms (10651.567 batches/s)
Epoch 15 in 92.8858ms (10765.908 batches/s)
Epoch 16 in 94.3197ms (10602.239 batches/s)
Epoch 17 in 94.9412ms (10532.835 batches/s)
Epoch 18 in 94.1452ms (10621.891 batches/s)
Epoch 19 in 93.5499ms (10689.481 batches/s)
  • with MATMUL_NUM_THREADS=4:
Epoch 0 in 92.8763ms (10767.009 batches/s)
Epoch 1 in 93.8463ms (10655.721 batches/s)
Epoch 2 in 93.0505ms (10746.854 batches/s)
Epoch 3 in 93.2228ms (10726.990 batches/s)
Epoch 4 in 93.7438ms (10667.372 batches/s)
Epoch 5 in 93.0092ms (10751.624 batches/s)
Epoch 6 in 93.5389ms (10690.740 batches/s)
Epoch 7 in 93.9582ms (10643.030 batches/s)
Epoch 8 in 93.6632ms (10676.552 batches/s)
Epoch 9 in 95.1873ms (10505.604 batches/s)
Epoch 10 in 93.4395ms (10702.111 batches/s)
Epoch 12 in 93.2298ms (10726.184 batches/s)
Epoch 13 in 92.834ms (10771.915 batches/s)
Epoch 14 in 93.0201ms (10750.365 batches/s)
Epoch 15 in 93.5465ms (10689.871 batches/s)
Epoch 16 in 94.2627ms (10608.649 batches/s)
Epoch 17 in 93.4911ms (10696.204 batches/s)
Epoch 18 in 93.7193ms (10670.161 batches/s)
Epoch 19 in 93.7442ms (10667.326 batches/s)

Performance is pretty much identical, but since I couldn't get the tensors to be pre-sampled, I measured their impact on each epoch (around 10%):

  • with MATMUL_NUM_THREADS=1:
Batch sampling 0 in 9.674ms (103369.859 samplings/s)
Batch sampling 1 in 10.1066ms (98945.242 samplings/s)
Batch sampling 2 in 9.9193ms (100813.570 samplings/s)
Batch sampling 3 in 9.9976ms (100024.008 samplings/s)
Batch sampling 4 in 9.9421ms (100582.367 samplings/s)
Batch sampling 5 in 9.9558ms (100443.961 samplings/s)
Batch sampling 6 in 9.943ms (100573.266 samplings/s)
Batch sampling 7 in 10.2383ms (97672.461 samplings/s)
Batch sampling 8 in 10.0382ms (99619.453 samplings/s)
Batch sampling 9 in 10.0456ms (99546.070 samplings/s)
Batch sampling 10 in 9.8453ms (101571.312 samplings/s)
Batch sampling 11 in 9.7879ms (102166.961 samplings/s)
Batch sampling 12 in 10.0004ms (99996.000 samplings/s)
Batch sampling 13 in 10.1479ms (98542.555 samplings/s)
Batch sampling 14 in 9.8581ms (101439.430 samplings/s)
Batch sampling 15 in 9.9786ms (100214.461 samplings/s)
Batch sampling 16 in 10.0615ms (99388.766 samplings/s)
Batch sampling 17 in 10.0925ms (99083.484 samplings/s)
Batch sampling 18 in 9.9659ms (100342.164 samplings/s)
Batch sampling 19 in 9.821ms (101822.633 samplings/s)
  • with MATMUL_NUM_THREADS=4:
Batch sampling 0 in 10.0156ms (99844.242 samplings/s)
Batch sampling 1 in 9.837ms (101657.016 samplings/s)
Batch sampling 2 in 9.3604ms (106833.039 samplings/s)
Batch sampling 3 in 9.9398ms (100605.648 samplings/s)
Batch sampling 4 in 10.1919ms (98117.133 samplings/s)
Batch sampling 5 in 9.6616ms (103502.523 samplings/s)
Batch sampling 6 in 9.6002ms (104164.500 samplings/s)
Batch sampling 7 in 9.7433ms (102634.625 samplings/s)
Batch sampling 8 in 9.5042ms (105216.641 samplings/s)
Batch sampling 9 in 9.9855ms (100145.211 samplings/s)
Batch sampling 10 in 10.803ms (92566.883 samplings/s)
Batch sampling 12 in 9.9396ms (100607.672 samplings/s)
Batch sampling 13 in 9.9663ms (100338.141 samplings/s)
Batch sampling 14 in 9.8273ms (101757.352 samplings/s)
Batch sampling 15 in 9.6275ms (103869.133 samplings/s)
Batch sampling 16 in 9.7661ms (102395.023 samplings/s)
Batch sampling 17 in 9.7453ms (102613.570 samplings/s)
Batch sampling 18 in 9.9289ms (100716.094 samplings/s)
Batch sampling 19 in 9.6684ms (103429.734 samplings/s)

To me, they look practically identical. @ViliamVadocz I could also try some other combinations of input dimension, batch size and number of batches. Otherwise, I think we can open the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants