-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[src] CUDA Online/Offline pipelines + light batched nnet3 driver #3568
[src] CUDA Online/Offline pipelines + light batched nnet3 driver #3568
Conversation
8cb7666
to
9417763
Compare
In cudadecoder/batched-static-nnet3.cc, we notice there is no right context padding at utt end. There used to be right context number of frames padded at utt end. Is this intended? We have some words dropped at the utt end. |
int input_frames_per_chunk_; | ||
int output_frames_per_chunk_; | ||
BaseFloat seconds_per_chunk_; | ||
BaseFloat samples_per_chunk_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use int
for variable samples_per_chunk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change that. Thanks
@pingpiang2019 The offline wrapper will take care of flushing the right context at the end. If you use it directly in online mode, then for now the best way is to send an extra chunk with silence in it - to flush the right context. It will be fixed at some point. |
@hugovbraun Hi, any update on this? Thank you very much anyway! |
@ppamorim yes, I've resumed work on this. Currently in the process in getting it ready for merge and more thorough testing. |
Does this mean we can't opt out of gpu_feature_extract anymore? Does the GPU feature extract support pitch? At some point, it doesn't, according to: |
@hugovbraun Hi, because there is no pitch cuda code, in current code we can set --gpu-feature-extract=false to use cpu to compute feature. But in your new code, this setting is removed. How can we still use cpu to compute features to support pitch? Thanks a lot! |
Ok, that's an issue. Is pitch the only thing missing? We may need to add (back) the option for cpu fe for the offline pipeline. |
@hugovbraun Thanks a lot for quick reply. So far we only find pitch missing, also feature extractions include plp, fbank are also not inlcude in GPU version, so CPU extraction is important in currect stage. Another choice is to use CPU to extract pitch and still using GPU to extract MFCC. And I read the code, it seems that offline pipeline and online pipline is the same in your new design, it means if the CPU fe is added in offline pipleline, the online can be supported, right? Thanks! |
Ok, thanks for the info. |
9417763
to
cea3ebb
Compare
OnlineNnet2FeaturePipeline feature(*feature_info_); | ||
// TODO clean following lines | ||
input_dim_ = feature.InputFeature()->Dim(); | ||
ivector_dim_ = feature.IvectorFeature()->Dim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the current code require a MFCC+ivector model? I tried running batched-wav-nnet3-cuda
with a model on hand that uses fbank without ivectors, and it SegFaults at this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my fbank-based model works fine (using batched-wav-nnet3-cuda executable).
sorry,my executable was from offline batch. My fbank model is not working also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not tested yet without ivectors
Sorry,my executable was from offline batch. That comment was wrong. |
sorry, wrong branch. |
Any hints on how to pinpoint the problem? This is on CentOS 7, with Nvidia driver 440.33.01 and CUDA 10.2.89. |
adea2f6
to
19241b8
Compare
@twisteroidambassador what kind of features are you using? Any chance you are using a model without ivectors? |
@twisteroidambassador Just fixed a bug. However the case "model with mfcc but without ivectors" is still untested so there may be others. |
@hugovbraun I was in fact using a f-bank model without i-vectors, so I had to transplant the original spectral feature code supporting fbank into the PR, and add a bunch of I see that master now has new gpu online feature extraction natively supporting fbank. I'll try the latest version of this PR, see whether any more i-vector checks are necessary, and report back. |
Attached patch that allows all three executables under kaldi-gpu-online-ivector.diff.txt However, the recognition result of |
@twisteroidambassador by old executables do you mean cudadecoderbin/batched-wav-nnet3-cuda and cudadecoderbin/batched-wav-nnet3-cuda2, or just the version without the "2"? |
@hugovbraun both |
@twisteroidambassador Ok. batched-wav-nnet3-cuda2 calls the online pipeline behind the scene, to the exception of the feature extraction if --use-online-features is set to false (default). Could you try running batched-wav-nnet3-cuda2 with --use-online-features=true to try to isolate the bug? |
@twisteroidambassador I can repro the problem locally with a fbank model with --use-online-features=true. Looks like we have a bug with our online fbank code. We'll take a look |
@hugovbraun Yes, I can confirm that adding |
e472e3a
to
beb1c4b
Compare
I was just thinking, what will happen if a correlation ID is repeated in a batch? i.e. if a batch is [(corrID 1, chunk 0-50), (corrID 1, chunk 50-100), (corrID 1, chunk 100-150)], does it still work correctly, error out, or do something unexpected? |
Also, is there an expected accuracy drop with the new online pipeline? I compared the recognition results between |
@twisteroidambassador You need to use at most one chunk per corr_id per batch. Looks like I forgot to put a comment about this. We'll add an assert at some point. |
@hugovbraun I have been using the exact same config options for GPU and CPU decoders, including setting a low-ish |
I think some documentation on how to tune the GPU decoder would be good, if
it doesn't already exist.
…On Wed, Mar 11, 2020 at 10:39 AM twisteroid ambassador < ***@***.***> wrote:
@hugovbraun <https://github.com/hugovbraun> I have been using the exact
same config options for GPU and CPU decoders, including setting a low-ish
--max-active. Does the GPU decoder need a higher --max-active value to
achieve the same accuracy as the CPU decoder?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3568?email_source=notifications&email_token=AAZFLOZEDASG6TARB7FXA5TRG32W3A5CNFSM4ITMJ6C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEON5FWA#issuecomment-597414616>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYFRZ7KORSGOVAIDGLRG32W3ANCNFSM4ITMJ6CQ>
.
|
@hugovbraun I have a question about receiving partial results. My use case is pretty usual: to listen to audio channels, to send partial results back often, and to show final results after it. Like multi-channel online2-tcp-nnet3-decode-faster.
|
With some more testing, I determined that with my current model and config (specifically beam, lattice-beam and max-active options), the limiting factor is beam. After loosening the constraints gradually until the one-best result no longer changes, for some utterances the result from GPU decoder is still not the same as that from the CPU decoder. So, I guess my question is: Should I expect the GPU and CPU decoder to produce identical output for the same utterance, either when constrained to the same pruning options such as beam, or when unconstrained? |
No they are not identical. Some slightly different design decisions were
made in the GPU decoder.
…On Fri, Mar 13, 2020 at 4:51 PM twisteroid ambassador < ***@***.***> wrote:
With some more testing, I determined that with my current model and config
(specifically beam, lattice-beam and max-active options), the limiting
factor is beam. After loosening the constraints gradually until the
one-best result no longer changes, for some utterances the result from GPU
decoder is still not the same as that from the CPU decoder.
So, I guess my question is: Should I expect the GPU and CPU decoder to
produce identical output for the same utterance, either when constrained to
the same pruning options such as beam, or when unconstrained?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3568 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO674LYJHPRT6PTYITDRHHXZ5ANCNFSM4ITMJ6CQ>
.
|
Is there any documentation, discussion, mail archive, etc. where I can learn about the algorithmic and design differences between the GPU and CPU decoder? I'll also try reading the code, but a higher level description would be very welcome. |
@twisteroidambassador @danpovey You are right, we need to write an up to date "How to use" guide for the decoder itself.
The final results are not exactly the same as the CPU ones (due to the reasons listed above), but are described as "virtually identical" by our partners. @al-zatv implementing an efficient way to get back partial results after each output frames is next on our list (with endpointing, which is a similar change). As of today there's no real way to get back partial results (you could rely on the "normal" GetRawLattice, but it will be slow) |
@hugovbraun I reran the tests to be sure. Yes, starting from my "stock" settings ( Given that, it's strange that at stock settings, one-best cpu vs gpu is quite different, at ~50 utterances. And looking at the generated lattice files, when --determinize-lattice=false the gpu ark is at ~5MB while the cpu ark is at ~50MB, and with --determinize-lattice=true the gpu ark is ~700KB while the cpu ark is ~1.4MB. (These are binary ark files.) As for --main-q-capacity and --aux-q-capacity, I have not seen any related warning messages printed to the console, so I don't think those were hit. |
How does the pipeline handle acoustic / language model scaling? When the |
And a related question. Since (I was looking through nnet3-latgen-faster-batch.cc for scaling operations, to see whether that explains why the results are different from the gpu pipeline. I found several places where it seems to scale inversely w.r.t. acoustic_scale before output, but couldn't find where it scales proportionally to acoustic_scale. It was rather confusing.) |
@twisteroidambassador the acoustic scale is applied there: https://github.com/hugovbraun/kaldi/blob/0b865f664e4e2e71bae0ecb5c94436047937ea09/src/cudadecoder/batched-static-nnet3.cc#L288 |
Did not realize this was ready. Merging. |
Just found out that, when calling DecodeBatch(wave_samples), if one of the SubVectors in wave_samples has length 0, the next call to RunNnet3 will freeze. |
I think you're right, sending a chunk of length 0 should be valid and allow you to end an utterance. We'll make the change. Thanks for reporting that. For future issues, it may be better to create a new github issue to simplify tracking. |
hello If is it possible to modifiy the decode to a sever which could support multiply request from different client? |
This is still WIP. Requires some cleaning, integrating the online mfcc into a separate PR (cf below), and some other things.
Implementing a low-latency high-throughput pipeline designed for online. It uses the GPU decoder, the GPU mfcc/ivector, and a new lean nnet3 driver (including nnet3 context switching on device).
The online pipeline can be seen as taking a batch as input, and then processing a very regular algorithm of calling feature extraction, nnet3, decoder, and postprocessing on that same batch, in a synchronous fashion (i.e. all of those steps will run when DecodeBatch is called. Nothing is sent to some async pipelines along the way). What happens when you run DecodeBatch is very regular, and because of that it is able to guarantee some latency constraints (because the way the code will be executed is very predicable). It also focus on being lean, avoiding reallocations or recomputations (such as recompiling nnet3).
The online pipeline takes care of computing [MFCC, iVectors], nnet3, decoder, postprocessing. It can either uses as input chunks of raw audio (and then compute mfcc->nnet3->decoder->postprocessing), or it can be called directly with mfcc features/ivectors (and then compute nnet3->decoder->postprocessing). The second possibility is used by the offline wrapper when use_online_ivectors=false.
The old offline pipeline is replaced by a new offline pipeline which is mostly a wrapper around the online pipeline. What it does is having an offline-friendly API (accepting full utterances as input instead of chunks), and has the possibility to pre-compute ivectors on the full utterance first (use_online_ivectors = false). It then calls the online pipeline internally to compute most of the work.
The easiest way to test the online pipeline end-to-end is to call it through the offline wrapper for now, with use_online_ivectors = true. Please note that ivectors will be ignored for now in this full end-to-end online (i.e. when use_online_ivectors=true). That's because the GPU ivectors are not yet ready for online. However the pipeline code is ready. The offline pipeline with use_online_ivectors=false should be fully functional and returns the same WER than before.
It includes a new light nnet3 driver designed for the GPU. The key idea is that it's usually better to waste some flops to compute things such as partial chunks or partial batches. For example for the last chunk (nframes=17) of an utterance, that chunk can be smaller than max_chunk_size (50 frames per default). It that case compiling a new nnet3 computation for that exact chunk size is slower than just running it for a chunk size of 50 and ignoring the invalid output.
Same idea for batch_size: The nnet3 computation will always run a fixed minibatch size. It is defined as minibatch_size = std::min(max_batch_size, MAX_MINIBATCH_SIZE). MAX_MINIBATCH_SIZE is defined to be large enough to hide the kernel launch latency and increase the arithmetic intensity of the GEMMs, but not larger, so that partial batches will not be slowed down too much (i.e. avoiding to run a minibatch of size 512 where only 72 utterances are valid). MAX_MINIBATCH_SIZE is currently 128. We'll then run nnet3 multiple time on the same batch if necessary. If batch_size=512, we'll run nnet3 (with minibatch_size=128) four times.
The context-switch (to restore the nnet left and right context, and ivector) is done on device. Everything that needs context-switch is using the concept of channels, to be consistent with the GPU decoder.
Those "lean" approaches gave us better performance, and a drop in memory usage (total GPU memory usage from 15GB to 4GB for librispeech and batch size 500). It also removes the need for "high level" multithreading (i.e. cuda-control-threads).
Dropping some parameters because the new code design doesn't require them (--cuda-control-threads, the drain size parameter). In theory the configuration should be greatly simplified (only --max-batch-size needs to be set, others are optional).
The code in cudafeat/ is modifying the mfcc GPU code. MFCC features can now be batched and processed online (restoring a few hundreds frames of past audio for each new chunk). That code was implemented by @mcdavid109 (thanks!). We'll create a separate PR for this, it requires some cleaning, and a large part of the code is redundant with existing mfcc files.
GPU batched online ivectors and cmvn are WIP.
When used with use_online_ivectors=false, that code reach 4,940 XRTF on librispeech/test_clean, with a latency around 6x realtime for max_batch_size=512 (latency would be lower with smaller max_batch_size).
One use case where that GPU pipeline can be used in a situation where only latency matters (and not throughput) is for instance on the jetson nano, where some initial runs were measured at 5-10x realtime latency for a single channel (max_batch_size=1) on librispeech/clean. Those measurements are indicative only - more reliable measurements will be done in the future.