Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Convolution component in nnet2 #47

Merged
merged 7 commits into from
Aug 16, 2015
Merged

add Convolution component in nnet2 #47

merged 7 commits into from
Aug 16, 2015

Conversation

naxingyu
Copy link
Contributor

@naxingyu naxingyu commented Aug 3, 2015

Add Convolution component in nnet2, including component test code. The propagation and backprop codes are imported from nnet1. Preliminary test on egs/hkust shows reasonable result in the average parallel framework.

@danpovey
Copy link
Contributor

danpovey commented Aug 3, 2015

Thanks,
Some issues:

  • DerivNeedsInput() should return false.
  • This component does not have anywhere near enough documentation. It's
    not at all clear what the patch stuff is all about, or what the meanings of
    the various strides, etc., are.
  • I would probably want there to be an example script for this before
    considering checking it in.

Dan

On Sun, Aug 2, 2015 at 8:12 PM, Xingyu Na notifications@github.com wrote:

Add Convolution component in nnet2, including component test code. The
propagation and backprop codes are imported from nnet1. Preliminary test on

egs/hkust shows reasonable result in the average parallel framework.

You can view, comment on, or merge this pull request online at:

#47
Commit Summary

  • add Convolution component in nnet2

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#47.

@naxingyu
Copy link
Contributor Author

naxingyu commented Aug 3, 2015

OK. I'll write some comment for documentation, making the terms clear. I do have an example script for training ConvNet, including Convolution and Maxpooling. The Maxpooling component was meant to be committed from another PR. I could use one PR for Convolution, Maxpooling and an example script instead of doing these one step at a time. What's your suggestion?

@danpovey
Copy link
Contributor

danpovey commented Aug 3, 2015

Let's eventually do it in one pull request, but you can keep updating this
one so I can check your documentation.
Dan

On Sun, Aug 2, 2015 at 9:53 PM, Xingyu Na notifications@github.com wrote:

OK. I'll write some comment for documentation, making the terms clear. I
do have an example script for training ConvNet, including Convolution and
Maxpooling. The Maxpooling component was meant to be committed from another
PR. I could use one PR for Convolution, Maxpooling and an example script
instead of doing these one step at a time. What's your suggestion?


Reply to this email directly or view it on GitHub
#47 (comment).

@naxingyu
Copy link
Contributor Author

naxingyu commented Aug 3, 2015

BackpropNeedsInput() returns false. Maxpooling component is committed. Both with document and component test code. Example scripts are provided.

@danpovey
Copy link
Contributor

danpovey commented Aug 4, 2015

Sorry I haven't gotten to this yet, I'll try to review it in the next few hours.

@danpovey
Copy link
Contributor

danpovey commented Aug 6, 2015

[edit: better name than ConvolutionalComponent1d would be Convolutional1dComponent.]

@naxingyu
Copy link
Contributor Author

naxingyu commented Aug 7, 2015

Hi Dan, thanks for checking.

  1. Documentation and usage in the example scripts are updated. The document in shell scripts are used to explain ConvNet. ReLU is used instead of Sigmoid for activation after affine component and result is updated;
  2. Documentation in Convolutional1d (renamed) and Maxpooling components are updated. Documentation before class declaration is used to explain the functionalities of the components. The convolutional propagation is explained in detail before the member function definition. Sorry for the long text, I wanted to keep it brief and accurate, failed....
  3. I tested the speed during training. Averagely, one job of ConvNet training took 58.4 seconds, while one job of ReLU network of similar size took 40.3 seconds. In Convolutional1d component, the inner loops are mostly used for matrix/vector copy. In Maxpooling component, yes, they don't seem to be well optimized for GPU.

@danpovey
Copy link
Contributor

danpovey commented Aug 7, 2015

[Yenda and Karel, the part that relates to you is at the end].

Great-- thanks a lot for this!
In order to make the convolutional code efficient, we may have to go back
and forth a few more times-- I hope you don't mind (and that you aren't on
some kind of deadline for finishing this up).
Some questions:

  • Did switching from sigmoid to ReLU change the results?
  • Do you have any nnet baseline for the convnet results? I don't see
    anything in the RESULTS file-- it would be nice to have that. Possibly the
    RESULTS file was not up to date.
    Regarding efficiency:
    If you run the program with gdb --args (program) (args), run with "r",
    and ctrl-c into it, look at the backtrace ("bt") and "continue" a bunch of
    times (you may have to restart the program if it dies after the first time)
    you can get a feeling for which parts of the code it is spending a lot of
    time in.

I noticed some inefficiency in the Convolutional1dComponent::Propagate
function. Obviously this comes from the original nnet1 code, that there is
no harm in fixing that too, which is no hurry (but ideally it would be best
to test the fixes).
The vectorized_feature_patches_ is a vector of "num_patches" Matrixes, and
they are all of the same dimension. You could very easily just make it a
single Matrix with "num_patches" times as many columns, and access the
pieces when needed using a SubMatrix (the ColRange function would be
helpful). Then instead of "num_patches" calls to CopyCols, you can have
just one.

Now let's consider the backprop code. Currently the code is quite
inefficient, with two loops. Imagine you have structured the code in the
way described in the above paragraph, so the "forward" code consists of one
CopyRows call, something like patches.CopyCols(*in,
vector_of_column_indexes);. Ideally we'd like the backprop code to consist
of a similar call, like in->CopyCols(patches, vector_of_column_indexes);.
There is a problem with this. Firstly, we'd have to invert the
column-indexes map because in this call, it's indexed by the row of "in".
And because this is not a one-to-one map, it's not invertible; each column
of "in" may correspond to several columns of "patches". We could try to
fix this by making several of these vector_of_column_indexes vectors (a
number equal to the maximum number of times any given column of "in"
appears in the patches.. pad with -1s where needed to fill holes).
However, of course we need a suitable AddCols function instead of CopyCols.
[You could use CopyCols the first time, and AddCols thereafter]. The CUDA
programming is very simple, as it just involved copying all the CopyCols
code with a very tiny modification. Organized this way, the number of
kernel invocations will be minimized; it will equal the maximum number of
times the patches overlap. When coding this, you could use three
variables.

// column_map is indexed by the column-index of "patches", and the value is
the corresponding column-index of "in".
std::vector column_map; // this used to be called column_mask but
that is a bad name.
// reversed_column_map is indexed by the column-index c of "in", and it
contains the list of column-indexes d of "patches" such that column_map[d]
= c.
std::vectorstd::vector reversed_column_map;
// rearranged_column_map is a list of vectors. Each vector in the list is
indexed by the column-index c of "in", and contains either some
column-index d of "patches" such that column_map[d] = c, or -1. The number
of vectors equals the maximum size of any of the lists in
"reversed_column_map".
std::vectorstd::vector rearranged_column_map;

In the above, I use the word "list" to refer to a std::vector of elements
in no particular order, and "vector" to refer to a std::vector where the
index has a particular meaning.

For the core of the Propagate and Backprop functions, rather than have a
loop over the matrix multiply it would be much better to use the batched
matrix-multiply feature of CuBLAS:
http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemmbatched
However, this is only supported in the "new" CUDA API, cublas_v2.h. (
http://docs.nvidia.com/cuda/cublas/#axzz3iABfeGBt). For our current
purposes, let's forget about this.

However, the v2 API has been supported since v4.0 of the CUDA toolkit (i.e.
for quite a while; we are now on 7), and it definitely makes sense to
upgrade the Kaldi project to use the new CUDA API, so that in time we can
make use of this functionality. @jtrmal, could you please create a new
branch in the kaldi-asr repo called cuda-v2, as a copy of trunk?
@vesis84, it would be good if you could help with this, as you wrote the
original cuda matrix library.
The new CUDA API supports having multiple execution contexts. The current
CuDevice class would be very limiting. @vesis84, do you have time to think
about the best way to rework things and come up with some kind of sketch of
a replacement? Or if you decide that leaving it mostly unchanged is the
best way, let us know why.

Dan

On Fri, Aug 7, 2015 at 12:41 AM, Xingyu Na notifications@github.com wrote:

Hi Dan, thanks for checking.

  1. Documentation and usage in the example scripts are updated. ReLU is
    used instead of Sigmoid for activation after affine component. The document
    in shell scripts are used to explain ConvNet;
  2. Documentation in Convolutional1d (renamed) and Maxpooling components
    are updated. Documentation before class declaration is used to explain the
    functionalities of the components. The convolutional propagation is
    explained in detail before the member function definition. Sorry for the
    long text, I wanted to keep it brief and accurate, failed....
  3. I tested the speed during training. Averagely, one job of ConvNet
    training took 58.4 seconds, while one job of ReLU network of similar size
    took 40.3 seconds. In Convolutional1d component, the inner loops are mostly
    used for matrix/vector copy. In Maxpooling component, yes, they don't seem
    to be well optimized for GPU.


Reply to this email directly or view it on GitHub
#47 (comment).

@naxingyu
Copy link
Contributor Author

Sorry for pause. Sure, I don't mind. I believe this is necessary for project like Kaldi.

  • Switching from Sigmoid to ReLU change WER from 40.87% to 40.73%.
  • The hkust example is quite out-dated. I have updated the RESULTS (and scripts) for SGMM, Nnet1 networks and Nnet2 networks in my local repo. I could open another PR to update the hkust example, if necessary?
  • I'll use gdb for a detailed timing info.
  • Yes. The convolutional1d code came from the nnet1 code. I copied the whole component and adapted it to the nnet2 framework. For Propagate and Backprop inefficiency, you suggestion sounds plausible, and I'll try it with sufficient test. Thank you very much for the design.
  • I was trying to follow the cublas_v2 issue. So I'll still use loop over matrix multiply in this branch, and keep it up-to-date when the header is updated.

@danpovey
Copy link
Contributor

Thanks.
Yes, updating the HKUST recipe as a separate PR sounds like a good idea.
Dan

Sorry for pause. Sure, I don't mind. I believe this is necessary for

project like Kaldi.

  • Switching from Sigmoid to ReLU change WER from 40.87% to 40.73%.
  • The hkust example is quite out-dated. I have updated the RESULTS
    (and scripts) for SGMM, Nnet1 networks and Nnet2 networks in my local repo.
    I could open another PR to update the hkust example, if necessary?
  • I'll use gdb for a detailed timing info.
  • Yes. The convolutional1d code came from the nnet1 code. I copied the
    whole component and adapted it to the nnet2 framework. For Propagate and
    Backprop inefficiency, you suggestion sounds plausible, and I'll try it
    with sufficient test. Thank you very much for the design.
  • I was trying to follow the cublas_v2 issue. So I'll still use loop
    over matrix multiply in this branch, and keep it up-to-date when the header
    is updated.


Reply to this email directly or view it on GitHub
#47 (comment).

@naxingyu
Copy link
Contributor Author

Hi Dan, I update the code as you suggested, and it worked as expected.

  • The patches are used in both Propagate and Update, so I add a mutable member to avoid creating the patches twice.
  • AddCols code done, for both GPU and CPU.
  • Both cu-matrix test and nnet-component test are passed.
  • Sanity: hkust experiment shows the same results as the previous implementation.
  • Speed: One job of ConvNet training takes 47.1 seconds now, compared with previous timing 58.4 seconds. I used gdb to pause and backtrace at running time for 20 times. 9 times at convolutional component, 7 times at affine component, and 4 times at maxpooling component.

HKUST recipe will be updated independently in another PR.

@danpovey
Copy link
Contributor

We're making progress. A few issues remain..

  • Please don't use a mutable member. This code is supposed to work in multi-threaded updates, and the const needs to be enforced. You'll have to use a temporary. You can initialize it as
    CuMatrix patches(num_frames, filter_dim * num_patches, kUndefined);
    since you'll be copying to all columns of it.
    You can use reserve() on the std::vector of indexes to improve efficiency of setting it up.
    After addressing the issues described here, please provide more detail in your gdb statistics on where exactly in the convolutional component it stopped. (e.g. which lines in the forward or backward?)
    The way you use num_patches separate calls to in_deriv->AddCols() is inefficient. Some patches don't overlap with each other so you could use the same call. Instead of looping this way, I recommend to derive the list of reverse maps directly from the forward map, which you can do with the following private static member functions:

    /*
    This function does an operation similar to reversing a map, except it handles maps
    that are not one-to-one by outputting the reversed map as a vector of lists.
    @param [in] forward_indexes is a vector of int32, each of whose elements
    is between 0 and input_dim - 1.
    @param [in] input_dim. See definitions of forward_indexes and backward_indexes
    @param [out] backward_indexes This function outputs to backward_indexes a
    vector of dimension input_dim, of lists. The list at (backward_indexes[i])
    is a list of all indexes j such that forward_indexes[j] == i.
    */
    static void ReverseIndexes(const std::vector &forward_indexes,
    int32 input_dim,
    std::vectorstd::vector *backward_indexes);
    /

    This function transforms a vector of lists into a list of vectors, padded with -1.
    @param [in] The input vector of lists. Let in.size() be D, and let the
    longest list length (i.e. the max of in[i].size()) be L.
    @param [out] The output list of vectors. The length of the list will be L,
    each vector-dimension will be D (i.e. (_out)[i].size() ==D), and
    if in[i] == j, then for some k we will have that (_out)[k][j] == i.
    The output vectors are padded with -1 where necessary if not
    all the input lists have the same side.
    */
    static void RearrangeIndexes(const std::vectorstd::vector &in,
    std::vectorstd::vector *out);

@naxingyu
Copy link
Contributor Author

Dan. I'm having trouble rearranging the reversed indexes. The reversing can be done very quickly by parsing the forward_indexes. However, rearranging seems not doable with a single loop. I used many loops with conditions.... It seems very inefficient... Am I missing something?

@danpovey
Copy link
Contributor

You'd probably need a loop first to figure out the maximum list length;
then size the outputs correctly and fill them up with -1's; then iterate
over each of the input lists and set each of the output vectors in turn.
Dan

On Thu, Aug 13, 2015 at 2:31 AM, Xingyu Na notifications@github.com wrote:

Dan. I'm having trouble rearranging the reversed indexes. The reversing
can be done very quickly by parsing the forward_indexes. However,
rearranging seems not doable with a single loop. I used many loops with
conditions.... It seems very inefficient... Am I missing something?


Reply to this email directly or view it on GitHub
#47 (comment).

@naxingyu
Copy link
Contributor Author

Yes that's exactly how I did it. The problem is that the third part, list-to-vector conversion, consumes a lot of conditions to decide which vector to go into. The code is ugly. I need to redo it....

@danpovey
Copy link
Contributor

It shouldn't be complicated. For each list, just put the i'th element in
the i'th vector.
Dan

On Thu, Aug 13, 2015 at 6:11 PM, Xingyu Na notifications@github.com wrote:

Yes that's exactly how I did it. The problem is that the third part,
list-to-vector conversion, consumes a lot of conditions to decide which
vector to go into. The code is ugly. I need to redo it....


Reply to this email directly or view it on GitHub
#47 (comment).

@naxingyu
Copy link
Contributor Author

It works now... There is one thing I don't follow. If the BackpropNeedsInput() return false, the forward data will be deleted, then how are the gradients computed?

@danpovey
Copy link
Contributor

OK, cool. I'll try to review it again soon and maybe merge.
The backprop for the convolutional component does not require the forward
data because it is a linear operation; it only needs the derivatives from
the next layer.
Dan

On Thu, Aug 13, 2015 at 10:55 PM, Xingyu Na notifications@github.com
wrote:

It works now... There is one thing I don't follow. If the
BackpropNeedsInput() return false, the forward data will be deleted, then
how are the gradients computed?


Reply to this email directly or view it on GitHub
#47 (comment).

@naxingyu
Copy link
Contributor Author

I'm still working on it. I'll let you know when it's ready for review.

@naxingyu
Copy link
Contributor Author

Indexes operations done. Mutable removed.
Detailed timing using gdb:
(1) affine prop
(7) affine backprop
(2) maxpool prop
(7) maxpool backprop
(5) convolutional prop
(12) convolutional backprop
(2) normalize prop

@danpovey
Copy link
Contributor

Thanks!
Could you provide some more details for the gdb backtraces while it's in
the maxpool backprop, the convolutional prop and the convolutional
backprop? I.e. what lines of code is it in? The convolutional code has
quite a few different phases, so it's hard to guess this.

Dan

On Fri, Aug 14, 2015 at 12:12 AM, Xingyu Na notifications@github.com
wrote:

Indexes operations done. Mutable removed.
Detailed timing using gdb:
(1) affine prop
(7) affine backprop
(2) maxpool prop
(7) maxpool backprop
(5) convolutional prop
(12) convolutional backprop
(2) normalize prop


Reply to this email directly or view it on GitHub
#47 (comment).

@naxingyu
Copy link
Contributor Author

Sure. Here it is.
(1) maxpool->backprop->CuMatrix(): 4257
(2) maxpool->backprop->MulElements(): 4261
(3) convolution->prop->AddMatMat(): 3899
(3) convolution->backprop->AddMatMat(): 3999
(3) convolution->backprop->update->AddMatMat(): 4166
(3) convolution->backprop->ReverseIndexes(): 4013

@danpovey
Copy link
Contributor

I'm having trouble interpreting these line numbers. For instance, here
https://github.com/kaldi-asr/kaldi/pull/47/files
ReverseIndexes starts at line 3930, but you have a line number of 4166 for
it. Maybe your pull request is out of date?
Dan

On Sat, Aug 15, 2015 at 7:38 AM, Xingyu Na notifications@github.com wrote:

Sure. Here it is.
(1) maxpool->backprop->CuMatrix(): 4257
(2) maxpool->backprop->MulElements(): 4261
(3) convolution->prop->AddMatMat(): 3899
(3) convolution->backprop->AddMatMat(): 3999
(3) convolution->backprop->update->AddMatMat(): 4013
(3) convolution->backprop->ReverseIndexes(): 4166


Reply to this email directly or view it on GitHub
#47 (comment).

@naxingyu
Copy link
Contributor Author

The line number means where it is called. I post this number because it gives a hint about which step of the prop and backprop the program is stopped at.

@danpovey
Copy link
Contributor

Yes, but what I'm saying is that those line numbers appear not to be
correct. For instance, line 4166 is the line
filters_grad.AddMatMat(1.0, diff_patch, kTrans, patch, kNoTrans, 1.0);
(see here https://github.com/kaldi-asr/kaldi/pull/47/files)
which has nothing to do with ReverseIndexes. Dan

On Sat, Aug 15, 2015 at 6:05 PM, Xingyu Na notifications@github.com wrote:

The line number means where it is called. I post this number because it
gives a hint the program is stopped at which step of the prop and backprop.


Reply to this email directly or view it on GitHub
#47 (comment).

@danpovey
Copy link
Contributor

OK, it could be you just reversed two of the line numbers.
It's disappointing if time is being taken in ReverseIndexes(), that should
be a fast operation.
I suggest to have a temporary variable
int32 size = forward_indexes.size()
which will prevent the compiler from having to re-compute the size each
time (sometimes compilers assume things could be being changed by another
thread so they won't assume something is constant).
It may also help to estimate a rough upper bound on how big each of the
backward lists will have to be (e.g. 2 + forward_indexes.size() /
input_dim), and iterate through all the backward lists and reserve() them
to that size before you do the main loop. Learn how to use C++ iterators
for this, e.g.
std::vectorstd::vector::iterator iter = backward_lists.begin(),
end = backward_lists.end();
for (; iter != end; ++iter)
iter->reserve(reserve_size);

Dan

On Sat, Aug 15, 2015 at 10:04 PM, Daniel Povey dpovey@gmail.com wrote:

Yes, but what I'm saying is that those line numbers appear not to be
correct. For instance, line 4166 is the line
filters_grad.AddMatMat(1.0, diff_patch, kTrans, patch, kNoTrans, 1.0);
(see here https://github.com/kaldi-asr/kaldi/pull/47/files)
which has nothing to do with ReverseIndexes. Dan

On Sat, Aug 15, 2015 at 6:05 PM, Xingyu Na notifications@github.com
wrote:

The line number means where it is called. I post this number because it
gives a hint the program is stopped at which step of the prop and backprop.


Reply to this email directly or view it on GitHub
#47 (comment).

@naxingyu
Copy link
Contributor Author

Thanks. It's done. I ran gdb again and ReserveIndexes() didn't appear any more. Instead, more AffineComponent::Backprop showed up.

@danpovey
Copy link
Contributor

Cool- I think we're done! Merging.
Please don't forget the separate pull request for the baselines in the HKUST recipe. Maybe it will be easier once this one is done.
Dan

danpovey added a commit that referenced this pull request Aug 16, 2015
add Convolution component in nnet2
@danpovey danpovey merged commit 5526c21 into kaldi-asr:master Aug 16, 2015
@danpovey
Copy link
Contributor

BTW, if you have time at some point, it would be a good thing to take some of the efficiency improvements, and apply them to the nnet1 1d-convolutaional code.

@naxingyu
Copy link
Contributor Author

Cool :) Yes that's exactly what I planned to do.

@KarelVesely84
Copy link
Contributor

Hi, many thanks for the enhancement! (getting back to work after two weeks off)

galv pushed a commit to galv/kaldi that referenced this pull request Dec 10, 2022
Resolve "Long stream for streaming"

Closes kaldi-asr#47

See merge request dl/dgx/kaldi!103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants