[src] CUDA Online/Offline pipelines + light batched nnet3 driver #3568

hugovbraun · 2019-09-04T01:24:23Z

This is still WIP. Requires some cleaning, integrating the online mfcc into a separate PR (cf below), and some other things.

Implementing a low-latency high-throughput pipeline designed for online. It uses the GPU decoder, the GPU mfcc/ivector, and a new lean nnet3 driver (including nnet3 context switching on device).

Online/Offline pipelines

The online pipeline can be seen as taking a batch as input, and then processing a very regular algorithm of calling feature extraction, nnet3, decoder, and postprocessing on that same batch, in a synchronous fashion (i.e. all of those steps will run when DecodeBatch is called. Nothing is sent to some async pipelines along the way). What happens when you run DecodeBatch is very regular, and because of that it is able to guarantee some latency constraints (because the way the code will be executed is very predicable). It also focus on being lean, avoiding reallocations or recomputations (such as recompiling nnet3).

The online pipeline takes care of computing [MFCC, iVectors], nnet3, decoder, postprocessing. It can either uses as input chunks of raw audio (and then compute mfcc->nnet3->decoder->postprocessing), or it can be called directly with mfcc features/ivectors (and then compute nnet3->decoder->postprocessing). The second possibility is used by the offline wrapper when use_online_ivectors=false.

The old offline pipeline is replaced by a new offline pipeline which is mostly a wrapper around the online pipeline. What it does is having an offline-friendly API (accepting full utterances as input instead of chunks), and has the possibility to pre-compute ivectors on the full utterance first (use_online_ivectors = false). It then calls the online pipeline internally to compute most of the work.

The easiest way to test the online pipeline end-to-end is to call it through the offline wrapper for now, with use_online_ivectors = true. Please note that ivectors will be ignored for now in this full end-to-end online (i.e. when use_online_ivectors=true). That's because the GPU ivectors are not yet ready for online. However the pipeline code is ready. The offline pipeline with use_online_ivectors=false should be fully functional and returns the same WER than before.

Light nnet3 driver designed for GPU and online

It includes a new light nnet3 driver designed for the GPU. The key idea is that it's usually better to waste some flops to compute things such as partial chunks or partial batches. For example for the last chunk (nframes=17) of an utterance, that chunk can be smaller than max_chunk_size (50 frames per default). It that case compiling a new nnet3 computation for that exact chunk size is slower than just running it for a chunk size of 50 and ignoring the invalid output.

Same idea for batch_size: The nnet3 computation will always run a fixed minibatch size. It is defined as minibatch_size = std::min(max_batch_size, MAX_MINIBATCH_SIZE). MAX_MINIBATCH_SIZE is defined to be large enough to hide the kernel launch latency and increase the arithmetic intensity of the GEMMs, but not larger, so that partial batches will not be slowed down too much (i.e. avoiding to run a minibatch of size 512 where only 72 utterances are valid). MAX_MINIBATCH_SIZE is currently 128. We'll then run nnet3 multiple time on the same batch if necessary. If batch_size=512, we'll run nnet3 (with minibatch_size=128) four times.

The context-switch (to restore the nnet left and right context, and ivector) is done on device. Everything that needs context-switch is using the concept of channels, to be consistent with the GPU decoder.

Those "lean" approaches gave us better performance, and a drop in memory usage (total GPU memory usage from 15GB to 4GB for librispeech and batch size 500). It also removes the need for "high level" multithreading (i.e. cuda-control-threads).

Parameters simplification

Dropping some parameters because the new code design doesn't require them (--cuda-control-threads, the drain size parameter). In theory the configuration should be greatly simplified (only --max-batch-size needs to be set, others are optional).

Adding batching and online to GPU mfcc

The code in cudafeat/ is modifying the mfcc GPU code. MFCC features can now be batched and processed online (restoring a few hundreds frames of past audio for each new chunk). That code was implemented by @mcdavid109 (thanks!). We'll create a separate PR for this, it requires some cleaning, and a large part of the code is redundant with existing mfcc files.
GPU batched online ivectors and cmvn are WIP.

Indicative measurements

When used with use_online_ivectors=false, that code reach 4,940 XRTF on librispeech/test_clean, with a latency around 6x realtime for max_batch_size=512 (latency would be lower with smaller max_batch_size).
One use case where that GPU pipeline can be used in a situation where only latency matters (and not throughput) is for instance on the jetson nano, where some initial runs were measured at 5-10x realtime latency for a single channel (max_batch_size=1) on librispeech/clean. Those measurements are indicative only - more reliable measurements will be done in the future.

pingpiang2019 · 2019-10-11T00:53:58Z

In cudadecoder/batched-static-nnet3.cc, we notice there is no right context padding at utt end. There used to be right context number of frames padded at utt end. Is this intended? We have some words dropped at the utt end.

qijiaxing · 2019-10-12T07:28:54Z

src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.h

+  int input_frames_per_chunk_;
+  int output_frames_per_chunk_;
+  BaseFloat seconds_per_chunk_;
+  BaseFloat samples_per_chunk_;


Shall we use int for variable samples_per_chunk?

I'll change that. Thanks

hugovbraun · 2019-10-14T20:14:44Z

@pingpiang2019 The offline wrapper will take care of flushing the right context at the end. If you use it directly in online mode, then for now the best way is to send an extra chunk with silence in it - to flush the right context. It will be fixed at some point.

ppamorim · 2019-11-14T11:35:14Z

@hugovbraun Hi, any update on this? Thank you very much anyway!

hugovbraun · 2019-11-15T18:51:40Z

@ppamorim yes, I've resumed work on this. Currently in the process in getting it ready for merge and more thorough testing.
@pingpiang2019 FYI the right context is now flushed automatically internally (as long as you set the end_of_sequence boolean).Got some more testing to do and then I'll push the commit

pingpiang2019 · 2019-11-19T12:57:07Z

Does this mean we can't opt out of gpu_feature_extract anymore? Does the GPU feature extract support pitch? At some point, it doesn't, according to:
batched-wav-nnet3-cuda core dump when set --gpu-feature-extract=true #3425
It seems our model with pitch crashes with similar backtrace and we can't fallback to cpu.

superliuwen · 2019-12-04T10:34:57Z

@hugovbraun Hi, because there is no pitch cuda code, in current code we can set --gpu-feature-extract=false to use cpu to compute feature. But in your new code, this setting is removed. How can we still use cpu to compute features to support pitch? Thanks a lot!

hugovbraun · 2019-12-05T21:31:58Z

Ok, that's an issue. Is pitch the only thing missing? We may need to add (back) the option for cpu fe for the offline pipeline.

superliuwen · 2019-12-06T02:57:18Z

Ok, that's an issue. Is pitch the only thing missing? We may need to add (back) the option for cpu fe for the offline pipeline.

@hugovbraun Thanks a lot for quick reply. So far we only find pitch missing, also feature extractions include plp, fbank are also not inlcude in GPU version, so CPU extraction is important in currect stage. Another choice is to use CPU to extract pitch and still using GPU to extract MFCC. And I read the code, it seems that offline pipeline and online pipline is the same in your new design, it means if the CPU fe is added in offline pipleline, the online can be supported, right? Thanks!

hugovbraun · 2019-12-06T21:12:47Z

Ok, thanks for the info.
We can add CPU FE back into the offline pipeline to allow use of special features. The offline and online pipelines are the same for everything except for feature extraction (because features such as ivectors can take advantage of being able to see the full utterance, and not process things in an online way).
For online support, we'll need to study things a bit more before making a decision.

twisteroidambassador · 2020-01-09T13:26:49Z

src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.cc

+  OnlineNnet2FeaturePipeline feature(*feature_info_);
+  // TODO clean following lines
+  input_dim_ = feature.InputFeature()->Dim();
+  ivector_dim_ = feature.IvectorFeature()->Dim();


Does the current code require a MFCC+ivector model? I tried running batched-wav-nnet3-cuda with a model on hand that uses fbank without ivectors, and it SegFaults at this line.

~~my fbank-based model works fine (using batched-wav-nnet3-cuda executable).~~

sorry,my executable was from offline batch. My fbank model is not working also.

Yes, it's not tested yet without ivectors

al-zatv · 2020-01-13T14:14:32Z

This implementation is little bit slower on my test model, than offline version from main kaldi repository.
....

Sorry,my executable was from offline batch. That comment was wrong.

al-zatv · 2020-01-19T15:32:57Z

~~in src/cudadecoderbin/batched-wav-nnet3-cuda.cc:229 probably, should be if (iterations > 1) or if (iter > 0) (and the former one is better than the latter, in my opinion)~~

sorry, wrong branch.

src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.cc

twisteroidambassador · 2020-02-13T11:05:52Z

batched-wav-nnet3-cuda from this PR would crash sporadically. I'm using the same data, model and command line each time, some times it runs happily to completion, and sometimes it terminates before recognizing the first utterance:

LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:SelectGpuId():cu-device.cc:223) CUDA setup operating under Compute Exclusive Mode.
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: Tesla V100-PCIE-16GB	free:15507M, used:653M, total:16160M, free/total:0.959574 version 7.0
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:CheckAndFixConfigs():nnet3/nnet-am-decodable-simple.h:123) Increasing --frames-per-chunk from 50 to 51 to make it a multiple of --frame-subsampling-factor=3
ERROR (batched-wav-nnet3-cuda[5.5.605-568892b]:AddMatMat():cu-matrix.cc:1317) cublasStatus_t 14 : "CUBLAS_STATUS_INTERNAL_ERROR" returned from 'cublas_gemm(GetCublasHandle(), (transB==kTrans? CUBLAS_OP_T:CUBLAS_OP_N), (transA==kTrans? CUBLAS_OP_T:CUBLAS_OP_N), m, n, k, alpha, B.data_, B.Stride(), A.data_, A.Stride(), beta, data_, Stride())'

[ Stack-Trace: ]
/home/kaldi-user/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x8b7) [0x2b918602ecfd]
./batched-wav-nnet3-cuda(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x42719f]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMatrixBase<float>::AddMatMat(float, kaldi::CuMatrixBase<float> const&, kaldi::MatrixTransposeType, kaldi::CuMatrixBase<float> const&, kaldi::MatrixTransposeType, float)+0x30d) [0x2b918307460d]
/home/kaldi-user/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::TdnnComponent::Propagate(kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*) const+0x200) [0x2b91824aaa4e]
/home/kaldi-user/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x1bc) [0x2b91823c77fe]
/home/kaldi-user/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x158) [0x2b91823c8c44]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedStaticNnet3::RunNnet3(kaldi::CuMatrix<float>*, int)+0x4f7) [0x2b91813db1c7]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedStaticNnet3::RunBatch(std::vector<int, std::allocator<int> > const&, std::vector<float*, std::allocator<float*> > const&, int, std::vector<float*, std::allocator<float*> > const&, std::vector<int, std::allocator<int> > const&, std::vector<bool, std::allocator<bool> > const&, kaldi::CuMatrix<float>*, std::vector<std::vector<std::pair<int, float*>, std::allocator<std::pair<int, float*> > >, std::allocator<std::vector<std::pair<int, float*>, std::allocator<std::pair<int, float*> > > > >*)+0x17e) [0x2b91813dc91a]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::RunNnet3(std::vector<int, std::allocator<int> > const&, std::vector<float*, std::allocator<float*> > const&, int, std::vector<int, std::allocator<int> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<float*, std::allocator<float*> > const&)+0x36) [0x2b91813c88b6]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::DecodeBatch(std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<float*, std::allocator<float*> > const&, int, std::vector<int, std::allocator<int> > const&, std::vector<float*, std::allocator<float*> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<int, std::allocator<int> >*)+0xaa) [0x2b91813c9eea]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::DecodeBatch(std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<kaldi::SubVector<float>, std::allocator<kaldi::SubVector<float> > > const&, std::vector<bool, std::allocator<bool> > const&)+0x686) [0x2b91813ca5dc]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline::ComputeTasks()+0xba) [0x2b91813d8a22]
./batched-wav-nnet3-cuda(std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline::*)()> (kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline*)> >::_M_run()+0x29) [0x426653]
/lib64/libstdc++.so.6(+0xb5070) [0x2b91b2685070]
/lib64/libpthread.so.0(+0x7e65) [0x2b918d00ce65]
/lib64/libc.so.6(clone+0x6d) [0x2b91b2beb88d]

Any hints on how to pinpoint the problem? This is on CentOS 7, with Nvidia driver 440.33.01 and CUDA 10.2.89.

hugovbraun · 2020-02-14T14:16:24Z

@twisteroidambassador what kind of features are you using? Any chance you are using a model without ivectors?

hugovbraun · 2020-02-14T14:31:28Z

@twisteroidambassador Just fixed a bug. However the case "model with mfcc but without ivectors" is still untested so there may be others.

twisteroidambassador · 2020-02-17T03:27:31Z

@hugovbraun I was in fact using a f-bank model without i-vectors, so I had to transplant the original spectral feature code supporting fbank into the PR, and add a bunch of if (has_ivector_) checks.

I see that master now has new gpu online feature extraction natively supporting fbank. I'll try the latest version of this PR, see whether any more i-vector checks are necessary, and report back.

twisteroidambassador · 2020-02-18T10:11:25Z

Attached patch that allows all three executables under cudadecoderbin to run with a fbank model without ivectors. It adds a bunch of ivector checks, and also fixes model_frequency in the online pipeline.

kaldi-gpu-online-ivector.diff.txt

However, the recognition result of batched-wav-nnet3-cuda-online are wrong, while the old executables produce the correct result.

hugovbraun · 2020-02-18T21:30:28Z

@twisteroidambassador by old executables do you mean cudadecoderbin/batched-wav-nnet3-cuda and cudadecoderbin/batched-wav-nnet3-cuda2, or just the version without the "2"?

twisteroidambassador · 2020-02-19T02:48:25Z

@hugovbraun both batched-wav-nnet3-cuda and batched-wav-nnet3-cuda2 produce correct results, while batched-wav-nnet3-cuda-online does not.

hugovbraun · 2020-02-19T20:14:21Z

@twisteroidambassador Ok. batched-wav-nnet3-cuda2 calls the online pipeline behind the scene, to the exception of the feature extraction if --use-online-features is set to false (default). Could you try running batched-wav-nnet3-cuda2 with --use-online-features=true to try to isolate the bug?

hugovbraun · 2020-02-19T21:08:53Z

@twisteroidambassador I can repro the problem locally with a fbank model with --use-online-features=true. Looks like we have a bug with our online fbank code. We'll take a look

twisteroidambassador · 2020-02-20T08:30:50Z

@hugovbraun Yes, I can confirm that adding --use-online-features=true to the invocation of batched-wav-nnet3-cuda2 produces incorrect recognition results, consistent with the results from batched-wav-nnet3-cuda-online.

…asr#3939)

twisteroidambassador · 2020-03-10T03:45:50Z

I was just thinking, what will happen if a correlation ID is repeated in a batch? i.e. if a batch is [(corrID 1, chunk 0-50), (corrID 1, chunk 50-100), (corrID 1, chunk 100-150)], does it still work correctly, error out, or do something unexpected?

twisteroidambassador · 2020-03-10T17:01:05Z

Also, is there an expected accuracy drop with the new online pipeline? I compared the recognition results between cudadecoderbin/batched-wav-nnet3-cuda-online and cudafeatbin/compute-online-feats-batched-cuda -> nnet3bin/nnet3-latgen-faster-batch, and there are some substantial differences. Presumably the only difference between these two pipelines is whether the decoder is run on GPU or CPU.

hugovbraun · 2020-03-10T18:26:43Z

@twisteroidambassador You need to use at most one chunk per corr_id per batch. Looks like I forgot to put a comment about this. We'll add an assert at some point.
Regarding WER, we usually see a difference in the order of magnitude of 0.0x of a percentile. For instance, For the librispeech model we use, we see WER around 5.54% instead of 5.50%. That's something we plan to look at at some point. If the difference you're seeing is larger, you can try increasing --max-active on the GPU version.

twisteroidambassador · 2020-03-11T02:39:29Z

@hugovbraun I have been using the exact same config options for GPU and CPU decoders, including setting a low-ish --max-active. Does the GPU decoder need a higher --max-active value to achieve the same accuracy as the CPU decoder?

danpovey · 2020-03-11T04:15:03Z

I think some documentation on how to tune the GPU decoder would be good, if it doesn't already exist.

…

On Wed, Mar 11, 2020 at 10:39 AM twisteroid ambassador < ***@***.***> wrote: @hugovbraun <https://github.com/hugovbraun> I have been using the exact same config options for GPU and CPU decoders, including setting a low-ish --max-active. Does the GPU decoder need a higher --max-active value to achieve the same accuracy as the CPU decoder? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3568?email_source=notifications&email_token=AAZFLOZEDASG6TARB7FXA5TRG32W3A5CNFSM4ITMJ6C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEON5FWA#issuecomment-597414616>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOYFRZ7KORSGOVAIDGLRG32W3ANCNFSM4ITMJ6CQ> .

al-zatv · 2020-03-12T21:02:31Z

@hugovbraun I have a question about receiving partial results. My use case is pretty usual: to listen to audio channels, to send partial results back often, and to show final results after it. Like multi-channel online2-tcp-nnet3-decode-faster.
But I don't see interface to take partial results from BatchedThreadedNnet3CudaOnlinePipeline. I see CudaDecoder::GetBestPath(), so I think my task is solvable. I wrote BatchedThreadedNnet3CudaOnlinePipeline::GetBestPath as follows, but it is slow and fails often. How to do it right?

void BatchedThreadedNnet3CudaOnlinePipeline::GetBestPath(
    const std::vector<CorrelationID> &corr_ids,
    std::vector<Lattice *> &fst_out_vec, bool use_final_probs) 
{
    std::vector<ChannelId> channels;
    channels.reserve(corr_ids.size());
  
    for (int i = 0; i < corr_ids.size(); ++i) {
        int corr_id = corr_ids[i];
        auto it = corr_id2channel_.find(corr_id);
        KALDI_ASSERT(it != corr_id2channel_.end());
        int ichannel = it->second;
        channels.push_back(ichannel);
    }    
    cuda_decoder_->GetBestPath(channels,fst_out_vec,false);
}

twisteroidambassador · 2020-03-13T08:51:29Z

With some more testing, I determined that with my current model and config (specifically beam, lattice-beam and max-active options), the limiting factor is beam. After loosening the constraints gradually until the one-best result no longer changes, for some utterances the result from GPU decoder is still not the same as that from the CPU decoder.

So, I guess my question is: Should I expect the GPU and CPU decoder to produce identical output for the same utterance, either when constrained to the same pruning options such as beam, or when unconstrained?

danpovey · 2020-03-13T09:07:15Z

No they are not identical. Some slightly different design decisions were made in the GPU decoder.

…

On Fri, Mar 13, 2020 at 4:51 PM twisteroid ambassador < ***@***.***> wrote: With some more testing, I determined that with my current model and config (specifically beam, lattice-beam and max-active options), the limiting factor is beam. After loosening the constraints gradually until the one-best result no longer changes, for some utterances the result from GPU decoder is still not the same as that from the CPU decoder. So, I guess my question is: Should I expect the GPU and CPU decoder to produce identical output for the same utterance, either when constrained to the same pruning options such as beam, or when unconstrained? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3568 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO674LYJHPRT6PTYITDRHHXZ5ANCNFSM4ITMJ6CQ> .

twisteroidambassador · 2020-03-16T10:27:45Z

Is there any documentation, discussion, mail archive, etc. where I can learn about the algorithmic and design differences between the GPU and CPU decoder? I'll also try reading the code, but a higher level description would be very welcome.

hugovbraun · 2020-03-16T17:49:17Z

@twisteroidambassador @danpovey You are right, we need to write an up to date "How to use" guide for the decoder itself.
In short:

Regarding parameters the main difference concerns --max-active. For the CPU decoder, that parameter will determine the max number of FST states kept at each iteration. For the GPU decoder, that parameter will determine the max number of FST arcs kept at each iteration. It is usually necessary to set the GPU parameter higher
All other decoding parameters should behave the same way. @twisteroidambassador Your comment about beam is interesting. Increasing --beam (vs cpu) improved WER but not --max-active?
For (very) challenging cases (bad audio) or an union ["challenge case" and "high beam"] you may need to rise the queue capacities with --main-q-capacity and --aux-q-capacity (--help will help you with what to set, try 2*default). Otherwise the mechanism responsible for overflow prevention will trigger and reduce the beam. It is usually not necessary to manually change the capacity of those queues.

The final results are not exactly the same as the CPU ones (due to the reasons listed above), but are described as "virtually identical" by our partners.

@al-zatv implementing an efficient way to get back partial results after each output frames is next on our list (with endpointing, which is a similar change). As of today there's no real way to get back partial results (you could rely on the "normal" GetRawLattice, but it will be slow)

twisteroidambassador · 2020-03-17T07:53:07Z

@hugovbraun I reran the tests to be sure. Yes, starting from my "stock" settings (--max-active=7000 --beam=6.0 --lattice-beam=8.0), raising --max-active to 14000 does not change the recognition results (cpu vs cpu, gpu vs gpu), the one-best results are identical within random variations. (By random variation, I mean when running batched-wav-nnet3-cuda-online repeatedly with same everything, the one-best results are not completely identical, out of the ~400 utterances usually 2~3 are different.) Raising --beam to 9.0 does result in significantly different one-best results. So I believe the --beam setting is the limiting factor in my stock settings.

Given that, it's strange that at stock settings, one-best cpu vs gpu is quite different, at ~50 utterances. And looking at the generated lattice files, when --determinize-lattice=false the gpu ark is at ~5MB while the cpu ark is at ~50MB, and with --determinize-lattice=true the gpu ark is ~700KB while the cpu ark is ~1.4MB. (These are binary ark files.)

As for --main-q-capacity and --aux-q-capacity, I have not seen any related warning messages printed to the console, so I don't think those were hit.

src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.h

src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.cc

twisteroidambassador · 2020-04-21T08:04:44Z

How does the pipeline handle acoustic / language model scaling? When the --acoustic-scale argument is given on the command line, are the scores scaled accordingly during nnet3 computations and decoding, and if so, is the lattice scaled back before output?

twisteroidambassador · 2020-04-21T09:01:55Z

And a related question. Since BatchedThreadedNnet3CudaOnlinePipelineConfig uses NnetSimpleComputationOptions, it inherits its description on the acoustic-scale option, which says "caution: is a no-op if set in the program nnet3-compute". Does this caution also apply to the other nnet3-* programs?

(I was looking through nnet3-latgen-faster-batch.cc for scaling operations, to see whether that explains why the results are different from the gpu pipeline. I found several places where it seems to scale inversely w.r.t. acoustic_scale before output, but couldn't find where it scales proportionally to acoustic_scale. It was rather confusing.)

hugovbraun · 2020-04-30T17:18:24Z

@twisteroidambassador the acoustic scale is applied there: https://github.com/hugovbraun/kaldi/blob/0b865f664e4e2e71bae0ecb5c94436047937ea09/src/cudadecoder/batched-static-nnet3.cc#L288

danpovey · 2020-05-01T03:20:22Z

Did not realize this was ready. Merging.

…di-asr#3568)

twisteroidambassador · 2020-06-19T08:00:39Z

Just found out that, when calling DecodeBatch(wave_samples), if one of the SubVectors in wave_samples has length 0, the next call to RunNnet3 will freeze.
It can be worked around easily by passing a SubVector of length 1 instead, but that feels rather awkward.
Since the only way to end an utterance is to pass in an additional chunk, maybe it should accept 0-length chunks? Or maybe there can be another method to end an utterance without requiring a chunk?

hugovbraun · 2020-06-19T16:31:50Z

I think you're right, sending a chunk of length 0 should be valid and allow you to end an utterance. We'll make the change. Thanks for reporting that.

For future issues, it may be better to create a new github issue to simplify tracking.

kli017 · 2020-06-22T09:25:24Z

hello If is it possible to modifiy the decode to a sever which could support multiply request from different client?

hugovbraun changed the title ~~[src] CUDA Online/Offline pipelines + light batched nnet3 driver~~ [src] [WIP] CUDA Online/Offline pipelines + light batched nnet3 driver Sep 4, 2019

hugovbraun mentioned this pull request Sep 4, 2019

TCP GPU decoding #3504

Closed

hugovbraun force-pushed the cuda_online_offline_pipelines branch from 8cb7666 to 9417763 Compare September 5, 2019 19:05

qijiaxing reviewed Oct 12, 2019

View reviewed changes

hugovbraun force-pushed the cuda_online_offline_pipelines branch from 9417763 to cea3ebb Compare January 8, 2020 01:50

twisteroidambassador reviewed Jan 9, 2020

View reviewed changes

twisteroidambassador reviewed Feb 10, 2020

View reviewed changes

src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.cc Outdated Show resolved Hide resolved

hugovbraun force-pushed the cuda_online_offline_pipelines branch 2 times, most recently from adea2f6 to 19241b8 Compare February 14, 2020 01:20

hugovbraun added 2 commits March 3, 2020 10:45

Removing cudadecoder/deprecated from gitignore

194f674

[src] Add GetSamplingFrequency() to class FeaturePipelineInfo (kaldi-…

beb1c4b

…asr#3939)

hugovbraun force-pushed the cuda_online_offline_pipelines branch from e472e3a to beb1c4b Compare March 3, 2020 18:59

twisteroidambassador reviewed Mar 27, 2020

View reviewed changes

src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.h Show resolved Hide resolved

twisteroidambassador reviewed Apr 2, 2020

View reviewed changes

src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.h Show resolved Hide resolved

twisteroidambassador reviewed Apr 8, 2020

View reviewed changes

src/cudadecoder/batched-threaded-nnet3-cuda-online-pipeline.cc Outdated Show resolved Hide resolved

hugovbraun added 2 commits April 13, 2020 13:55

Call callback even when lattice is empty

7998218

missing cuda online pipeline checkandfixconfig

0b865f6

danpovey merged commit 0bca93e into kaldi-asr:master May 1, 2020

pc-seawind pushed a commit to pc-seawind/kaldi that referenced this pull request Jun 4, 2020

[src] CUDA Online/Offline pipelines + light batched nnet3 driver (kal…

b40e172

…di-asr#3568)

nshmyrev mentioned this pull request Jul 2, 2020

Nodejs: undefined symbol: __cudaRegisterFatBinary alphacep/vosk-api#157

Closed

danpovey mentioned this pull request May 28, 2021

Delete the obsolete BatchedThreadedNnet3CudaPipeline? #4546

Closed

[src] CUDA Online/Offline pipelines + light batched nnet3 driver #3568

[src] CUDA Online/Offline pipelines + light batched nnet3 driver #3568

Conversation

hugovbraun commented Sep 4, 2019

pingpiang2019 commented Oct 11, 2019

qijiaxing Oct 12, 2019

Choose a reason for hiding this comment

hugovbraun Oct 14, 2019

Choose a reason for hiding this comment

hugovbraun commented Oct 14, 2019

ppamorim commented Nov 14, 2019 • edited Loading

hugovbraun commented Nov 15, 2019

pingpiang2019 commented Nov 19, 2019

superliuwen commented Dec 4, 2019 • edited Loading

hugovbraun commented Dec 5, 2019

superliuwen commented Dec 6, 2019

hugovbraun commented Dec 6, 2019

twisteroidambassador Jan 9, 2020

Choose a reason for hiding this comment

al-zatv Jan 13, 2020 • edited Loading

Choose a reason for hiding this comment

hugovbraun Feb 18, 2020

Choose a reason for hiding this comment

al-zatv commented Jan 13, 2020 • edited Loading

al-zatv commented Jan 19, 2020 • edited Loading

twisteroidambassador commented Feb 13, 2020 • edited Loading

hugovbraun commented Feb 14, 2020

hugovbraun commented Feb 14, 2020

twisteroidambassador commented Feb 17, 2020

twisteroidambassador commented Feb 18, 2020

hugovbraun commented Feb 18, 2020

twisteroidambassador commented Feb 19, 2020

hugovbraun commented Feb 19, 2020

hugovbraun commented Feb 19, 2020

twisteroidambassador commented Feb 20, 2020

twisteroidambassador commented Mar 10, 2020

twisteroidambassador commented Mar 10, 2020

hugovbraun commented Mar 10, 2020

twisteroidambassador commented Mar 11, 2020

danpovey commented Mar 11, 2020 via email

al-zatv commented Mar 12, 2020 • edited Loading

twisteroidambassador commented Mar 13, 2020

danpovey commented Mar 13, 2020 via email

twisteroidambassador commented Mar 16, 2020

hugovbraun commented Mar 16, 2020

twisteroidambassador commented Mar 17, 2020 • edited Loading

twisteroidambassador commented Apr 21, 2020

twisteroidambassador commented Apr 21, 2020

hugovbraun commented Apr 30, 2020

danpovey commented May 1, 2020

twisteroidambassador commented Jun 19, 2020

hugovbraun commented Jun 19, 2020

kli017 commented Jun 22, 2020

ppamorim commented Nov 14, 2019 •

edited

Loading

superliuwen commented Dec 4, 2019 •

edited

Loading

al-zatv Jan 13, 2020 •

edited

Loading

al-zatv commented Jan 13, 2020 •

edited

Loading

al-zatv commented Jan 19, 2020 •

edited

Loading

twisteroidambassador commented Feb 13, 2020 •

edited

Loading

al-zatv commented Mar 12, 2020 •

edited

Loading

twisteroidambassador commented Mar 17, 2020 •

edited

Loading