Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[src] CUDA Online/Offline pipelines + light batched nnet3 driver #3568

Merged
merged 14 commits into from
May 1, 2020

Conversation

hugovbraun
Copy link
Contributor

This is still WIP. Requires some cleaning, integrating the online mfcc into a separate PR (cf below), and some other things.

Implementing a low-latency high-throughput pipeline designed for online. It uses the GPU decoder, the GPU mfcc/ivector, and a new lean nnet3 driver (including nnet3 context switching on device).

  • Online/Offline pipelines

The online pipeline can be seen as taking a batch as input, and then processing a very regular algorithm of calling feature extraction, nnet3, decoder, and postprocessing on that same batch, in a synchronous fashion (i.e. all of those steps will run when DecodeBatch is called. Nothing is sent to some async pipelines along the way). What happens when you run DecodeBatch is very regular, and because of that it is able to guarantee some latency constraints (because the way the code will be executed is very predicable). It also focus on being lean, avoiding reallocations or recomputations (such as recompiling nnet3).

The online pipeline takes care of computing [MFCC, iVectors], nnet3, decoder, postprocessing. It can either uses as input chunks of raw audio (and then compute mfcc->nnet3->decoder->postprocessing), or it can be called directly with mfcc features/ivectors (and then compute nnet3->decoder->postprocessing). The second possibility is used by the offline wrapper when use_online_ivectors=false.

The old offline pipeline is replaced by a new offline pipeline which is mostly a wrapper around the online pipeline. What it does is having an offline-friendly API (accepting full utterances as input instead of chunks), and has the possibility to pre-compute ivectors on the full utterance first (use_online_ivectors = false). It then calls the online pipeline internally to compute most of the work.

The easiest way to test the online pipeline end-to-end is to call it through the offline wrapper for now, with use_online_ivectors = true. Please note that ivectors will be ignored for now in this full end-to-end online (i.e. when use_online_ivectors=true). That's because the GPU ivectors are not yet ready for online. However the pipeline code is ready. The offline pipeline with use_online_ivectors=false should be fully functional and returns the same WER than before.

  • Light nnet3 driver designed for GPU and online

It includes a new light nnet3 driver designed for the GPU. The key idea is that it's usually better to waste some flops to compute things such as partial chunks or partial batches. For example for the last chunk (nframes=17) of an utterance, that chunk can be smaller than max_chunk_size (50 frames per default). It that case compiling a new nnet3 computation for that exact chunk size is slower than just running it for a chunk size of 50 and ignoring the invalid output.

Same idea for batch_size: The nnet3 computation will always run a fixed minibatch size. It is defined as minibatch_size = std::min(max_batch_size, MAX_MINIBATCH_SIZE). MAX_MINIBATCH_SIZE is defined to be large enough to hide the kernel launch latency and increase the arithmetic intensity of the GEMMs, but not larger, so that partial batches will not be slowed down too much (i.e. avoiding to run a minibatch of size 512 where only 72 utterances are valid). MAX_MINIBATCH_SIZE is currently 128. We'll then run nnet3 multiple time on the same batch if necessary. If batch_size=512, we'll run nnet3 (with minibatch_size=128) four times.

The context-switch (to restore the nnet left and right context, and ivector) is done on device. Everything that needs context-switch is using the concept of channels, to be consistent with the GPU decoder.

Those "lean" approaches gave us better performance, and a drop in memory usage (total GPU memory usage from 15GB to 4GB for librispeech and batch size 500). It also removes the need for "high level" multithreading (i.e. cuda-control-threads).

  • Parameters simplification

Dropping some parameters because the new code design doesn't require them (--cuda-control-threads, the drain size parameter). In theory the configuration should be greatly simplified (only --max-batch-size needs to be set, others are optional).

  • Adding batching and online to GPU mfcc

The code in cudafeat/ is modifying the mfcc GPU code. MFCC features can now be batched and processed online (restoring a few hundreds frames of past audio for each new chunk). That code was implemented by @mcdavid109 (thanks!). We'll create a separate PR for this, it requires some cleaning, and a large part of the code is redundant with existing mfcc files.
GPU batched online ivectors and cmvn are WIP.

  • Indicative measurements

When used with use_online_ivectors=false, that code reach 4,940 XRTF on librispeech/test_clean, with a latency around 6x realtime for max_batch_size=512 (latency would be lower with smaller max_batch_size).
One use case where that GPU pipeline can be used in a situation where only latency matters (and not throughput) is for instance on the jetson nano, where some initial runs were measured at 5-10x realtime latency for a single channel (max_batch_size=1) on librispeech/clean. Those measurements are indicative only - more reliable measurements will be done in the future.

@hugovbraun hugovbraun changed the title [src] CUDA Online/Offline pipelines + light batched nnet3 driver [src] [WIP] CUDA Online/Offline pipelines + light batched nnet3 driver Sep 4, 2019
@hugovbraun hugovbraun mentioned this pull request Sep 4, 2019
@hugovbraun hugovbraun force-pushed the cuda_online_offline_pipelines branch from 8cb7666 to 9417763 Compare September 5, 2019 19:05
@pingpiang2019
Copy link

In cudadecoder/batched-static-nnet3.cc, we notice there is no right context padding at utt end. There used to be right context number of frames padded at utt end. Is this intended? We have some words dropped at the utt end.

int input_frames_per_chunk_;
int output_frames_per_chunk_;
BaseFloat seconds_per_chunk_;
BaseFloat samples_per_chunk_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use int for variable samples_per_chunk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change that. Thanks

@hugovbraun
Copy link
Contributor Author

@pingpiang2019 The offline wrapper will take care of flushing the right context at the end. If you use it directly in online mode, then for now the best way is to send an extra chunk with silence in it - to flush the right context. It will be fixed at some point.

@ppamorim
Copy link

ppamorim commented Nov 14, 2019

@hugovbraun Hi, any update on this? Thank you very much anyway!

@hugovbraun
Copy link
Contributor Author

@ppamorim yes, I've resumed work on this. Currently in the process in getting it ready for merge and more thorough testing.
@pingpiang2019 FYI the right context is now flushed automatically internally (as long as you set the end_of_sequence boolean).Got some more testing to do and then I'll push the commit

@pingpiang2019
Copy link

Does this mean we can't opt out of gpu_feature_extract anymore? Does the GPU feature extract support pitch? At some point, it doesn't, according to:
batched-wav-nnet3-cuda core dump when set --gpu-feature-extract=true #3425
It seems our model with pitch crashes with similar backtrace and we can't fallback to cpu.

@superliuwen
Copy link

superliuwen commented Dec 4, 2019

@hugovbraun Hi, because there is no pitch cuda code, in current code we can set --gpu-feature-extract=false to use cpu to compute feature. But in your new code, this setting is removed. How can we still use cpu to compute features to support pitch? Thanks a lot!

@hugovbraun
Copy link
Contributor Author

Ok, that's an issue. Is pitch the only thing missing? We may need to add (back) the option for cpu fe for the offline pipeline.

@superliuwen
Copy link

Ok, that's an issue. Is pitch the only thing missing? We may need to add (back) the option for cpu fe for the offline pipeline.

@hugovbraun Thanks a lot for quick reply. So far we only find pitch missing, also feature extractions include plp, fbank are also not inlcude in GPU version, so CPU extraction is important in currect stage. Another choice is to use CPU to extract pitch and still using GPU to extract MFCC. And I read the code, it seems that offline pipeline and online pipline is the same in your new design, it means if the CPU fe is added in offline pipleline, the online can be supported, right? Thanks!

@hugovbraun
Copy link
Contributor Author

Ok, thanks for the info.
We can add CPU FE back into the offline pipeline to allow use of special features. The offline and online pipelines are the same for everything except for feature extraction (because features such as ivectors can take advantage of being able to see the full utterance, and not process things in an online way).
For online support, we'll need to study things a bit more before making a decision.

@hugovbraun hugovbraun force-pushed the cuda_online_offline_pipelines branch from 9417763 to cea3ebb Compare January 8, 2020 01:50
OnlineNnet2FeaturePipeline feature(*feature_info_);
// TODO clean following lines
input_dim_ = feature.InputFeature()->Dim();
ivector_dim_ = feature.IvectorFeature()->Dim();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current code require a MFCC+ivector model? I tried running batched-wav-nnet3-cuda with a model on hand that uses fbank without ivectors, and it SegFaults at this line.

Copy link

@al-zatv al-zatv Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my fbank-based model works fine (using batched-wav-nnet3-cuda executable).

sorry,my executable was from offline batch. My fbank model is not working also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not tested yet without ivectors

@al-zatv
Copy link

al-zatv commented Jan 13, 2020

This implementation is little bit slower on my test model, than offline version from main kaldi repository.
....

Sorry,my executable was from offline batch. That comment was wrong.

@al-zatv
Copy link

al-zatv commented Jan 19, 2020

in src/cudadecoderbin/batched-wav-nnet3-cuda.cc:229 probably, should be if (iterations > 1) or if (iter > 0) (and the former one is better than the latter, in my opinion)

sorry, wrong branch.

@twisteroidambassador
Copy link

twisteroidambassador commented Feb 13, 2020

batched-wav-nnet3-cuda from this PR would crash sporadically. I'm using the same data, model and command line each time, some times it runs happily to completion, and sometimes it terminates before recognizing the first utterance:

LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:SelectGpuId():cu-device.cc:223) CUDA setup operating under Compute Exclusive Mode.
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: Tesla V100-PCIE-16GB	free:15507M, used:653M, total:16160M, free/total:0.959574 version 7.0
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
LOG (batched-wav-nnet3-cuda[5.5.605-568892b]:CheckAndFixConfigs():nnet3/nnet-am-decodable-simple.h:123) Increasing --frames-per-chunk from 50 to 51 to make it a multiple of --frame-subsampling-factor=3
ERROR (batched-wav-nnet3-cuda[5.5.605-568892b]:AddMatMat():cu-matrix.cc:1317) cublasStatus_t 14 : "CUBLAS_STATUS_INTERNAL_ERROR" returned from 'cublas_gemm(GetCublasHandle(), (transB==kTrans? CUBLAS_OP_T:CUBLAS_OP_N), (transA==kTrans? CUBLAS_OP_T:CUBLAS_OP_N), m, n, k, alpha, B.data_, B.Stride(), A.data_, A.Stride(), beta, data_, Stride())'

[ Stack-Trace: ]
/home/kaldi-user/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x8b7) [0x2b918602ecfd]
./batched-wav-nnet3-cuda(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x42719f]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuMatrixBase<float>::AddMatMat(float, kaldi::CuMatrixBase<float> const&, kaldi::MatrixTransposeType, kaldi::CuMatrixBase<float> const&, kaldi::MatrixTransposeType, float)+0x30d) [0x2b918307460d]
/home/kaldi-user/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::TdnnComponent::Propagate(kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float>*) const+0x200) [0x2b91824aaa4e]
/home/kaldi-user/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x1bc) [0x2b91823c77fe]
/home/kaldi-user/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x158) [0x2b91823c8c44]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedStaticNnet3::RunNnet3(kaldi::CuMatrix<float>*, int)+0x4f7) [0x2b91813db1c7]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedStaticNnet3::RunBatch(std::vector<int, std::allocator<int> > const&, std::vector<float*, std::allocator<float*> > const&, int, std::vector<float*, std::allocator<float*> > const&, std::vector<int, std::allocator<int> > const&, std::vector<bool, std::allocator<bool> > const&, kaldi::CuMatrix<float>*, std::vector<std::vector<std::pair<int, float*>, std::allocator<std::pair<int, float*> > >, std::allocator<std::vector<std::pair<int, float*>, std::allocator<std::pair<int, float*> > > > >*)+0x17e) [0x2b91813dc91a]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::RunNnet3(std::vector<int, std::allocator<int> > const&, std::vector<float*, std::allocator<float*> > const&, int, std::vector<int, std::allocator<int> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<float*, std::allocator<float*> > const&)+0x36) [0x2b91813c88b6]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::DecodeBatch(std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<float*, std::allocator<float*> > const&, int, std::vector<int, std::allocator<int> > const&, std::vector<float*, std::allocator<float*> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<int, std::allocator<int> >*)+0xaa) [0x2b91813c9eea]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::DecodeBatch(std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<kaldi::SubVector<float>, std::allocator<kaldi::SubVector<float> > > const&, std::vector<bool, std::allocator<bool> > const&)+0x686) [0x2b91813ca5dc]
/home/kaldi-user/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline::ComputeTasks()+0xba) [0x2b91813d8a22]
./batched-wav-nnet3-cuda(std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline::*)()> (kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline*)> >::_M_run()+0x29) [0x426653]
/lib64/libstdc++.so.6(+0xb5070) [0x2b91b2685070]
/lib64/libpthread.so.0(+0x7e65) [0x2b918d00ce65]
/lib64/libc.so.6(clone+0x6d) [0x2b91b2beb88d]

Any hints on how to pinpoint the problem? This is on CentOS 7, with Nvidia driver 440.33.01 and CUDA 10.2.89.

@hugovbraun hugovbraun force-pushed the cuda_online_offline_pipelines branch 2 times, most recently from adea2f6 to 19241b8 Compare February 14, 2020 01:20
@hugovbraun
Copy link
Contributor Author

@twisteroidambassador what kind of features are you using? Any chance you are using a model without ivectors?

@hugovbraun
Copy link
Contributor Author

@twisteroidambassador Just fixed a bug. However the case "model with mfcc but without ivectors" is still untested so there may be others.

@twisteroidambassador
Copy link

@hugovbraun I was in fact using a f-bank model without i-vectors, so I had to transplant the original spectral feature code supporting fbank into the PR, and add a bunch of if (has_ivector_) checks.

I see that master now has new gpu online feature extraction natively supporting fbank. I'll try the latest version of this PR, see whether any more i-vector checks are necessary, and report back.

@twisteroidambassador
Copy link

Attached patch that allows all three executables under cudadecoderbin to run with a fbank model without ivectors. It adds a bunch of ivector checks, and also fixes model_frequency in the online pipeline.

kaldi-gpu-online-ivector.diff.txt

However, the recognition result of batched-wav-nnet3-cuda-online are wrong, while the old executables produce the correct result.

@hugovbraun
Copy link
Contributor Author

@twisteroidambassador by old executables do you mean cudadecoderbin/batched-wav-nnet3-cuda and cudadecoderbin/batched-wav-nnet3-cuda2, or just the version without the "2"?

@twisteroidambassador
Copy link

@hugovbraun both batched-wav-nnet3-cuda and batched-wav-nnet3-cuda2 produce correct results, while batched-wav-nnet3-cuda-online does not.

@hugovbraun
Copy link
Contributor Author

@twisteroidambassador Ok. batched-wav-nnet3-cuda2 calls the online pipeline behind the scene, to the exception of the feature extraction if --use-online-features is set to false (default). Could you try running batched-wav-nnet3-cuda2 with --use-online-features=true to try to isolate the bug?

@hugovbraun
Copy link
Contributor Author

@twisteroidambassador I can repro the problem locally with a fbank model with --use-online-features=true. Looks like we have a bug with our online fbank code. We'll take a look

@twisteroidambassador
Copy link

@hugovbraun Yes, I can confirm that adding --use-online-features=true to the invocation of batched-wav-nnet3-cuda2 produces incorrect recognition results, consistent with the results from batched-wav-nnet3-cuda-online.

@hugovbraun hugovbraun force-pushed the cuda_online_offline_pipelines branch from e472e3a to beb1c4b Compare March 3, 2020 18:59
@twisteroidambassador
Copy link

I was just thinking, what will happen if a correlation ID is repeated in a batch? i.e. if a batch is [(corrID 1, chunk 0-50), (corrID 1, chunk 50-100), (corrID 1, chunk 100-150)], does it still work correctly, error out, or do something unexpected?

@twisteroidambassador
Copy link

Also, is there an expected accuracy drop with the new online pipeline? I compared the recognition results between cudadecoderbin/batched-wav-nnet3-cuda-online and cudafeatbin/compute-online-feats-batched-cuda -> nnet3bin/nnet3-latgen-faster-batch, and there are some substantial differences. Presumably the only difference between these two pipelines is whether the decoder is run on GPU or CPU.

@hugovbraun
Copy link
Contributor Author

@twisteroidambassador You need to use at most one chunk per corr_id per batch. Looks like I forgot to put a comment about this. We'll add an assert at some point.
Regarding WER, we usually see a difference in the order of magnitude of 0.0x of a percentile. For instance, For the librispeech model we use, we see WER around 5.54% instead of 5.50%. That's something we plan to look at at some point. If the difference you're seeing is larger, you can try increasing --max-active on the GPU version.

@twisteroidambassador
Copy link

@hugovbraun I have been using the exact same config options for GPU and CPU decoders, including setting a low-ish --max-active. Does the GPU decoder need a higher --max-active value to achieve the same accuracy as the CPU decoder?

@danpovey
Copy link
Contributor

danpovey commented Mar 11, 2020 via email

@al-zatv
Copy link

al-zatv commented Mar 12, 2020

@hugovbraun I have a question about receiving partial results. My use case is pretty usual: to listen to audio channels, to send partial results back often, and to show final results after it. Like multi-channel online2-tcp-nnet3-decode-faster.
But I don't see interface to take partial results from BatchedThreadedNnet3CudaOnlinePipeline. I see CudaDecoder::GetBestPath(), so I think my task is solvable. I wrote BatchedThreadedNnet3CudaOnlinePipeline::GetBestPath as follows, but it is slow and fails often. How to do it right?

void BatchedThreadedNnet3CudaOnlinePipeline::GetBestPath(
    const std::vector<CorrelationID> &corr_ids,
    std::vector<Lattice *> &fst_out_vec, bool use_final_probs) 
{
    std::vector<ChannelId> channels;
    channels.reserve(corr_ids.size());
  
    for (int i = 0; i < corr_ids.size(); ++i) {
        int corr_id = corr_ids[i];
        auto it = corr_id2channel_.find(corr_id);
        KALDI_ASSERT(it != corr_id2channel_.end());
        int ichannel = it->second;
        channels.push_back(ichannel);
    }    
    cuda_decoder_->GetBestPath(channels,fst_out_vec,false);
}

@twisteroidambassador
Copy link

With some more testing, I determined that with my current model and config (specifically beam, lattice-beam and max-active options), the limiting factor is beam. After loosening the constraints gradually until the one-best result no longer changes, for some utterances the result from GPU decoder is still not the same as that from the CPU decoder.

So, I guess my question is: Should I expect the GPU and CPU decoder to produce identical output for the same utterance, either when constrained to the same pruning options such as beam, or when unconstrained?

@danpovey
Copy link
Contributor

danpovey commented Mar 13, 2020 via email

@twisteroidambassador
Copy link

Is there any documentation, discussion, mail archive, etc. where I can learn about the algorithmic and design differences between the GPU and CPU decoder? I'll also try reading the code, but a higher level description would be very welcome.

@hugovbraun
Copy link
Contributor Author

@twisteroidambassador @danpovey You are right, we need to write an up to date "How to use" guide for the decoder itself.
In short:

  • Regarding parameters the main difference concerns --max-active. For the CPU decoder, that parameter will determine the max number of FST states kept at each iteration. For the GPU decoder, that parameter will determine the max number of FST arcs kept at each iteration. It is usually necessary to set the GPU parameter higher
  • All other decoding parameters should behave the same way. @twisteroidambassador Your comment about beam is interesting. Increasing --beam (vs cpu) improved WER but not --max-active?
  • For (very) challenging cases (bad audio) or an union ["challenge case" and "high beam"] you may need to rise the queue capacities with --main-q-capacity and --aux-q-capacity (--help will help you with what to set, try 2*default). Otherwise the mechanism responsible for overflow prevention will trigger and reduce the beam. It is usually not necessary to manually change the capacity of those queues.

The final results are not exactly the same as the CPU ones (due to the reasons listed above), but are described as "virtually identical" by our partners.

@al-zatv implementing an efficient way to get back partial results after each output frames is next on our list (with endpointing, which is a similar change). As of today there's no real way to get back partial results (you could rely on the "normal" GetRawLattice, but it will be slow)

@twisteroidambassador
Copy link

twisteroidambassador commented Mar 17, 2020

@hugovbraun I reran the tests to be sure. Yes, starting from my "stock" settings (--max-active=7000 --beam=6.0 --lattice-beam=8.0), raising --max-active to 14000 does not change the recognition results (cpu vs cpu, gpu vs gpu), the one-best results are identical within random variations. (By random variation, I mean when running batched-wav-nnet3-cuda-online repeatedly with same everything, the one-best results are not completely identical, out of the ~400 utterances usually 2~3 are different.) Raising --beam to 9.0 does result in significantly different one-best results. So I believe the --beam setting is the limiting factor in my stock settings.

Given that, it's strange that at stock settings, one-best cpu vs gpu is quite different, at ~50 utterances. And looking at the generated lattice files, when --determinize-lattice=false the gpu ark is at ~5MB while the cpu ark is at ~50MB, and with --determinize-lattice=true the gpu ark is ~700KB while the cpu ark is ~1.4MB. (These are binary ark files.)

As for --main-q-capacity and --aux-q-capacity, I have not seen any related warning messages printed to the console, so I don't think those were hit.

@twisteroidambassador
Copy link

How does the pipeline handle acoustic / language model scaling? When the --acoustic-scale argument is given on the command line, are the scores scaled accordingly during nnet3 computations and decoding, and if so, is the lattice scaled back before output?

@twisteroidambassador
Copy link

And a related question. Since BatchedThreadedNnet3CudaOnlinePipelineConfig uses NnetSimpleComputationOptions, it inherits its description on the acoustic-scale option, which says "caution: is a no-op if set in the program nnet3-compute". Does this caution also apply to the other nnet3-* programs?

(I was looking through nnet3-latgen-faster-batch.cc for scaling operations, to see whether that explains why the results are different from the gpu pipeline. I found several places where it seems to scale inversely w.r.t. acoustic_scale before output, but couldn't find where it scales proportionally to acoustic_scale. It was rather confusing.)

@hugovbraun
Copy link
Contributor Author

@danpovey
Copy link
Contributor

danpovey commented May 1, 2020

Did not realize this was ready. Merging.

@danpovey danpovey merged commit 0bca93e into kaldi-asr:master May 1, 2020
@twisteroidambassador
Copy link

Just found out that, when calling DecodeBatch(wave_samples), if one of the SubVectors in wave_samples has length 0, the next call to RunNnet3 will freeze.
It can be worked around easily by passing a SubVector of length 1 instead, but that feels rather awkward.
Since the only way to end an utterance is to pass in an additional chunk, maybe it should accept 0-length chunks? Or maybe there can be another method to end an utterance without requiring a chunk?

@hugovbraun
Copy link
Contributor Author

I think you're right, sending a chunk of length 0 should be valid and allow you to end an utterance. We'll make the change. Thanks for reporting that.

For future issues, it may be better to create a new github issue to simplify tracking.

@kli017
Copy link

kli017 commented Jun 22, 2020

hello If is it possible to modifiy the decode to a sever which could support multiply request from different client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants