-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate cuDNN, add conv2d CPU/GPU version (based on Eigen and cuDNN) #388
Conversation
This is excellent, thank you! I think we should probably create a |
Agree. Sure, let me mark this one as WIP. I will re-organize the structure a little bit by creating a separate folder to hold third party files.
Let me know if there's any concern on the above modifications. |
Thanks! Two things:
|
OK, cool. |
For 2., yes, the TensorFlow strategy is fine! |
good job. A cpu version is useful enough to help me prototype. Look forward to dynet with convolution operation. |
That is awesome I've been needing that for a while. Do you have an estimation of when you'll be pushing the GPU version? |
I probably can push an 80%-complete but usable version by next Wednesday (with GPU enabled). |
…rsion based on cuDNN 5.1), interface with an optional bias term
|
OK, I think most of the engineering effort for this issue has been finished. It would be great if someone could review this commit and leave comments. |
Why the shape of x_1 in conv2d is [W H C N]?
,I think the initial layout of the input(x_1) should be [C W H N] instead of [W H C N] |
Good question. This is actually a design choice (that needs some discussion). Here is the explanation (I'll assume RowMajor in the following discussion, and x_1 is the feature maps and x_2 the filters): The above conditions imply that there will be at least one layout transformation happening, either for CPU version or GPU version. Considering that layout transformation is usually equally fast on CPU and GPU, I design it to let it happen only f or CPU convolution. Reasons are as follows:
Let me know what you think about! |
Hi, just to confirm: I think the most important thing here is clarity and consistency with the standard DyNet API, so H should come before W, and it makes sense that C comes after HW. N could come either first or last, as both make sense. So in summary, I think NHWC is the most intuitive option here. Even if the performance is a little bit worse on CuDNN, I think we should make sure that we go with consistency and simplicity of the API, which is one of the core design principles behind DyNet. |
OK, then I will assume x_1 and x_2 both come in with the layout NHWC.
|
@zhisbug Thanks! Also, two more things:
|
Sorry, for 1. I answered my own question. N is the number of batches. Currently DyNet only supports the number of batches being the last dimension, so we probably need it to be |
@neubig To clarify, what I was talking about above is assuming RowMajor. so RowMajor NHWC/NCHW is ColMajor CWHN/WHCN, and cuDNN is RowMajor. But I am a bit confused by your comments "so we probably need it to be HWCN or CHWN for the input, but this is in ColMajor format". Could you clarify which one of HWCN and CHWN in ColMajor is preferred? Currently, the implementation assumes input comes in NCHW in rowmajor, which is WHCN in colmajor (which is aligned with DyNet's design that N comes as the last dim). |
OK, just to confirm, the canonical layout in DyNet would be:
re-written in tensorflow terminology, maybe this is the following (does W correspond to rows? because it's "width" I think it should correspond to the number of columns... Regardless, it doesn't matter, as the behavior of rows and columns is identical anyway.)
If we assume that this is written in the opposite order of the actual memory layout (according to the cuDNN convention):
So actually, I think things are OK. We will lay out things in the canonical order of "rows X columns X channels X filters" or "rows X columns X channels X minibatches" in DyNet Tensors, and then the memory order will correspond to the order that cuDNN expects. On CPU, this might mean that we have to do one transform, but I think that's OK. |
@neubig Thanks! This makes sense.
Let me do a final check whether my assumed input is the case. If it is not, I will need to do a minor modification before merging. |
Done. Now it assumes the input follows the HxWxCxN layout in ColMajor (thereby NxCxWxH in RowMajor), where
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Basically looks good, but there are a few places where we can remove the custom-implemented code and just rely on Eigen tensor operations, which will make things much simpler and cleaner. Also, please make sure that ordering of function arguments always follows the canonical column-major ordering, so rows come first, then columns, etc. I probably won't have time to test this before merging, so perhaps we can ask the DyNet user base to stress test it.
dynet/gpu-ops.h
Outdated
@@ -34,6 +34,7 @@ void sparse_to_dense_assign(int n, const unsigned int* ids, float* src, float* t | |||
void dense_to_sparse_subtract(int n, const unsigned int* ids, float* src, float* trg); | |||
void sparse_to_dense_block_assign_and_multiply(int n, const unsigned *idx, int bsize, float mult, float *src, float *trg); | |||
void dense_to_sparse_block_add(int n, const unsigned* ids, int bsize, float* src, float* trg); | |||
void pad_input(float* output, const float* input, int N, int C, int H, int W, int pad_right, int pad_bottom); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be an Eigen tensor operation "tensor.pad()" for this. Can you try to use this instead of writing a custom kernel? You can find an example in eigen/unsupported/tests/cxx11_tensor_padding.cpp
.
dynet/op-helper.h
Outdated
}; | ||
|
||
// template? | ||
struct NCWHToNWHC { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, there is an Eigen operation for this. See eigen/unsupported/tests/cxx11_tensor_shuffling.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced this layout transformation functions with Eigen functions.
…ames, and fix a small bug
dynet/cudnn-ops.cc
Outdated
//Eigen::array<int, 4> offsets = {0, 0, 0, 0}; | ||
//Eigen::array<int, 4> extents = {static_cast<int>(XH), static_cast<int>(XW), static_cast<int>(XC), static_cast<int>(XN)}; | ||
//TODO this functio cannot be linked | ||
//dxi->tb<3>().device(*dev.edevice) = padded_dx.tb<3>().slice(offsets, extents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I tried to use Eigen slice() function to do negative padding in place of my customized functions but ended up with a linking error. Can someone familiar with Eigen GPU operations take a look?
dynet/cudnn-ops.cc
Outdated
//paddings[2] = std::make_pair(0, 0); | ||
//paddings[3] = std::make_pair(0, 0); | ||
//TODO this function cannot be linked | ||
//padded_x.tb<3>().device(*dev.edevice) = xs[0]->tb<3>().pad(paddings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I tried to use Eigen pad() function to do positive padding on GPU tensors but ended up with a linking error. Can someone familiar with Eigen GPU operations take a look?
@zhisbug I think this is probably because you're compiling the |
Cool! Now it works! |
OK, great! I think the final step is now documentation removing the unnecessary code and documenting the functions. I think we should probably also deprecate (comment out) the existing conv1d functions, as they don't match expected behavior from any paper that I'm aware of. see: #236 |
@neubig Do you think the conv1d is still necessary? If it is, maybe I can take a look at it and reimplement it if needed. |
@zhisbug I don't think they're necessary, so for this commit let's just comment them out. |
@zhisbug Is this basically done except for doc? If so, let's do the doc and merge it in! (I'll be able to do this tomorrow if you're busy today) |
@neubig Yes, this is a complete version except for the docs. I'll create the doc tonight and let you know when it is done and mergeable! |
Can this be merged? I will add the python documentation directly on master |
@pmichel31415 Thanks, done! |
@neubig @pmichel31415 Cool! Thanks! |
Oh, btw, @neubig @pmichel31415 the installation guide should probably be updated to let the user be aware of the cuDNN support/requirement. |
After merging, the build does not pass.
|
#229 This is the CPU implementation based on Eigen SpatialConvolution. It is reported as the current fastest (available) CPU version of conv2d.
For GPU support, I think implementing a new version using cublas kernels (by hand) is worthless, so I am currently incorporating cudnn into DyNet and will provide a cudnn-based (standard) implementation.