-
-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add threading feature for matrixmultiply when device is CPU #401
Comments
Hey yeah this seems like a super easy win. I also just tried with some large convolution benchmark and it really helped there too. I think we should add this feature as a default. It looks like it does depend on std, so I think we'd have to turn it off in nostd: - std = ["no-std-compat/std", "rand/std", "rand_distr/std", "cudarc?/std"]
+ std = ["no-std-compat/std", "rand/std", "rand_distr/std", "cudarc?/std", "matrixmultiply/threading"] Want to open a PR? |
How does this impact tiny networks? Could the overhead of threading decrease performance? By tiny networks I mean something on the order of 3 feedfoward layers with 32 weights each. |
@ViliamVadocz good question, can you or @kstavro look into that? |
@coreylowman @ViliamVadocz Can you point me out to some meaningful dataset with low dimension (or example from this repo) to try out? I could make my one random dataset, but this would have to wait until the weekend. My level in Rust would require some googling even for the creating the random set as above, whereas in Python it would have taken me 5 mins or so. |
Yeah you can just create random tensors and operate on that: let x: Tensor<Rank2<64, 3>> = dev.sample_normal();
let y = model(x); There will be some time spent on sampling from the distribution, but that should be the same time whether you're using matrixmultiply threading or not, so its still safe to compare the two situations |
Ok, so I tried to already have the batch tensors generated before the training, so that we can only measure the training times, but since the tensors don't implement the copy trait and This is the constellation I tried with forward and backward:
Performance is pretty much identical, but since I couldn't get the tensors to be pre-sampled, I measured their impact on each epoch (around 10%):
To me, they look practically identical. @ViliamVadocz I could also try some other combinations of input dimension, batch size and number of batches. Otherwise, I think we can open the PR. |
matrixmultiply
already supports 1 to 4 threads, so its threading feature can possibly be easily integrated todfdx
.I have already manually cargo added
matrixmultiply = {version = "0.3.2", features = ["threading"]}
to the .toml file (right afterdfdx
), but training still looks single-threaded. Needs further investigation.Edit: I was wrong, just activating threading as above does improve performance. I was right that cpu utilization still looks rather single-threaded, but I guess there is a little more utilization of the remaining threads that helps overall performance.
For comparison (on the 06-mnist example with
BATCH_SIZE = 1024
on a Ryzen 5800x3d):$Env:MATMUL_NUM_THREADS=1; cargo run --release --example 06-mnist -- .\tmp\
:$Env:MATMUL_NUM_THREADS=4; cargo run --release --example 06-mnist -- .\tmp\
:Edit2: Batch size plays a big role to the performance gains, which makes sense, since the larger the batches, the more the overall load lies on the matmul I guess. Which was also why I initially thought that threading doesn't have an effect.
With the original
BATCH_SIZE = 32
of the example:$Env:MATMUL_NUM_THREADS=1; cargo run --release --example 06-mnist -- .\tmp\
:$Env:MATMUL_NUM_THREADS=4; cargo run --release --example 06-mnist -- .\tmp\
:The text was updated successfully, but these errors were encountered: