Re-enable cuda graphs in training modes. #9338

galv · 2024-05-29T15:47:29Z

"global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0.

This would cause the ASR_dev_run_Speech_To_Text_HF_Finetuning CI/CD test to fail sporadically (maybe 1 out of 5 times).

What does this PR do ?

Fixes the crash by using "thread_local" stream capture instead of "global" stream capture.

Collection: ASR

Usage

Cuda graphs will now be used by default for inference in both training and inference scripts, following the previous behavior.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

PR Type:

New Feature
[ X ] Bugfix
Documentation

Note that I tested this by applying the following diff:

modified   examples/asr/conf/asr_finetune/speech_to_text_hf_finetune.yaml                                                                                                                                   
@@ -134,6 +134,14 @@ model:                                                                                                                                                                                 
       warmup_steps: 5000                                                                                                                                                                                   
       warmup_ratio: null                                                                                                                                                                                   
       min_lr: 5e-6                                                                                                                                                                                         
+  decoding:                                                                                                                                                                                                
+    strategy: "greedy_batch"                                                                                                                                                                               
+                                                                                                                                                                                                           
+    # greedy strategy config                                                                                                                                                                               
+    greedy:                                                                                                                                                                                                
+      max_symbols: 5                                                                                                                                                                                       
+      loop_labels: false                                                                                                                                                                                   
+      use_cuda_graph_decoder: true                                                                                                                                                                         
                                                                                                                                                                                                            
 trainer:                                                                                                                                                                                                   
   devices: -1 # number of GPUs, -1 would use all available GPUs                                                                                                                                            
modified   examples/asr/speech_to_text_finetune.py                                                                                                                                                          
@@ -212,6 +212,12 @@ def main(cfg):                                                                                                                                                                         
     if hasattr(cfg.model, 'spec_augment') and cfg.model.spec_augment is not None:                                                                                                                          
         asr_model.spec_augment = ASRModel.from_config_dict(cfg.model.spec_augment)                                                                                                                         
                                                                                                                                                                                                            
+    if hasattr(asr_model, 'change_decoding_strategy') and hasattr(asr_model, 'decoding'):                                                                                                                  
+        # This is not robust to all model types.                                                                                                                                                           
+        # import ipdb; ipdb.set_trace()                                                                                                                                                                    
+        decoding_cfg = cfg.model.decoding                                                                                                                                                                  
+        asr_model.change_decoding_strategy(decoding_cfg)                                                                                                                                                   
+                                                                                                                                                                                                           
     trainer.fit(asr_model)

Then I ran this script:

for i in $(seq 1 100); do

python examples/asr/speech_to_text_finetune.py \
       --config-path="conf/asr_finetune" --config-name="speech_to_text_hf_finetune" \
       ~model.train_ds.hf_data_cfg \
       model.train_ds.num_workers=1 \
       model.train_ds.batch_size=2 model.validation_ds.batch_size=2 \
       model.train_ds.streaming=true \
       +model.train_ds.hf_data_cfg.path="librispeech_asr" \
       +model.train_ds.hf_data_cfg.name=null \
       +model.train_ds.hf_data_cfg.split="test.clean" \
       +model.train_ds.hf_data_cfg.streaming=true \
       ~model.validation_ds.hf_data_cfg \
       model.validation_ds.streaming=true \
       +model.validation_ds.hf_data_cfg.path="librispeech_asr" \
       +model.validation_ds.hf_data_cfg.name=null \
       +model.validation_ds.hf_data_cfg.split="test.clean" \
       +model.validation_ds.hf_data_cfg.streaming=true \
       ~model.test_ds \
       init_from_pretrained_model="stt_en_fastconformer_transducer_large" \
       model.tokenizer.update_tokenizer=False \
       model.optim.sched.warmup_steps=0 \
       +model.optim.sched.max_steps=3 \
       trainer.max_epochs=null \
       trainer.devices=1 \
       trainer.accelerator="gpu" \
       +trainer.fast_dev_run=True \
       model.decoding.greedy.loop_labels=True \
       exp_manager.exp_dir=examples/asr/speech_finetuning_results
done

for i in $(seq 1 100); do
    python examples/asr/speech_to_text_finetune.py \
       --config-path="conf/asr_finetune" --config-name="speech_to_text_hf_finetune" \
       ~model.train_ds.hf_data_cfg \
       model.train_ds.num_workers=1 \
       model.train_ds.batch_size=2 model.validation_ds.batch_size=2 \
       model.train_ds.streaming=true \
       +model.train_ds.hf_data_cfg.path="librispeech_asr" \
       +model.train_ds.hf_data_cfg.name=null \
       +model.train_ds.hf_data_cfg.split="test.clean" \
       +model.train_ds.hf_data_cfg.streaming=true \
       ~model.validation_ds.hf_data_cfg \
       model.validation_ds.streaming=true \
       +model.validation_ds.hf_data_cfg.path="librispeech_asr" \
       +model.validation_ds.hf_data_cfg.name=null \
       +model.validation_ds.hf_data_cfg.split="test.clean" \
       +model.validation_ds.hf_data_cfg.streaming=true \
       ~model.test_ds \
       init_from_pretrained_model="stt_en_fastconformer_transducer_large" \
       model.tokenizer.update_tokenizer=False \
       model.optim.sched.warmup_steps=0 \
       +model.optim.sched.max_steps=3 \
       trainer.max_epochs=null \
       trainer.devices=1 \
       trainer.accelerator="gpu" \
       +trainer.fast_dev_run=True \
       model.decoding.greedy.loop_labels=False \
       exp_manager.exp_dir=examples/asr/speech_finetuning_results
done

Basically, I needed to add a way for speech_to_text_finetune.py to be able to modify the decoding algorithm, so I could test both the loop frames and loop labels code paths. I do not include this code in the PR, since it is not robust to all model types (e.g., AED). Since I run 100 times for each algorithm, we can be pretty sure that this fixes the problem.

titu1994

The only change is capture error mode ?

galv · 2024-05-29T16:07:02Z

@titu1994 yes, that is the only change, and I am very confident in it. I can elaborate if you want.

galv · 2024-05-29T16:08:04Z

Well, I also had to undo Vladimir's previous commit that turns cuda graphs off by default, except in transcribe_speech.py and transcribe_speech_parallel.py. This commit's changes: bb26e9846f

artbataev

@galv, very cool, thank you!
Please, add these changes also to nemo/collections/asr/parts/submodules/tdt_loop_labels_computer.py

galv · 2024-05-29T17:39:23Z

@artbataev good point. I completely missed TDT. Done. I'm not sure how to test that one, but I suspect that the change is low risk anyway.

"global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

Signed-off-by: galv <galv@users.noreply.github.com>

* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Apply isort and black reformatting Signed-off-by: galv <galv@users.noreply.github.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com>

* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

github-actions bot added the ASR label May 29, 2024

galv requested review from artbataev, titu1994 and pablo-garay May 29, 2024 15:48

galv force-pushed the fix-cudnn-cuda-graph-error branch from 9a79543 to cec4f53 Compare May 29, 2024 15:49

galv added the Run CICD label May 29, 2024

titu1994 reviewed May 29, 2024

View reviewed changes

artbataev reviewed May 29, 2024

View reviewed changes

galv force-pushed the fix-cudnn-cuda-graph-error branch 2 times, most recently from deb8d28 to 9c52705 Compare May 29, 2024 17:49

galv force-pushed the fix-cudnn-cuda-graph-error branch from 9c52705 to 45a2981 Compare May 29, 2024 18:36

galv added Run CICD and removed Run CICD labels May 29, 2024

Apply isort and black reformatting

ee6253f

Signed-off-by: galv <galv@users.noreply.github.com>

titu1994 approved these changes May 29, 2024

View reviewed changes

titu1994 added Run CICD and removed Run CICD labels May 29, 2024

galv merged commit 4cefd5d into NVIDIA:r2.0.0rc0 May 29, 2024
109 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable cuda graphs in training modes. #9338

Re-enable cuda graphs in training modes. #9338

galv commented May 29, 2024 •

edited

Loading

titu1994 left a comment

galv commented May 29, 2024

galv commented May 29, 2024 •

edited

Loading

artbataev left a comment

galv commented May 29, 2024

Re-enable cuda graphs in training modes. #9338

Re-enable cuda graphs in training modes. #9338

Conversation

galv commented May 29, 2024 • edited Loading

What does this PR do ?

Usage

GitHub Actions CI

titu1994 left a comment

Choose a reason for hiding this comment

galv commented May 29, 2024

galv commented May 29, 2024 • edited Loading

artbataev left a comment

Choose a reason for hiding this comment

galv commented May 29, 2024

galv commented May 29, 2024 •

edited

Loading

galv commented May 29, 2024 •

edited

Loading