Re-enable cuda graphs in training modes. #9343

github-actions · 2024-05-29T22:04:07Z

"global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0.

This would cause the ASR_dev_run_Speech_To_Text_HF_Finetuning CI/CD test to fail sporadically (maybe 1 out of 5 times).

What does this PR do ?

Fixes the crash by using "thread_local" stream capture instead of "global" stream capture.

Collection: ASR

Usage

Cuda graphs will now be used by default for inference in both training and inference scripts, following the previous behavior.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

PR Type:

New Feature
[ X ] Bugfix
Documentation

Note that I tested this by applying the following diff:

modified   examples/asr/conf/asr_finetune/speech_to_text_hf_finetune.yaml                                                                                                                                   
@@ -134,6 +134,14 @@ model:                                                                                                                                                                                 
       warmup_steps: 5000                                                                                                                                                                                   
       warmup_ratio: null                                                                                                                                                                                   
       min_lr: 5e-6                                                                                                                                                                                         
+  decoding:                                                                                                                                                                                                
+    strategy: "greedy_batch"                                                                                                                                                                               
+                                                                                                                                                                                                           
+    # greedy strategy config                                                                                                                                                                               
+    greedy:                                                                                                                                                                                                
+      max_symbols: 5                                                                                                                                                                                       
+      loop_labels: false                                                                                                                                                                                   
+      use_cuda_graph_decoder: true                                                                                                                                                                         
                                                                                                                                                                                                            
 trainer:                                                                                                                                                                                                   
   devices: -1 # number of GPUs, -1 would use all available GPUs                                                                                                                                            
modified   examples/asr/speech_to_text_finetune.py                                                                                                                                                          
@@ -212,6 +212,12 @@ def main(cfg):                                                                                                                                                                         
     if hasattr(cfg.model, 'spec_augment') and cfg.model.spec_augment is not None:                                                                                                                          
         asr_model.spec_augment = ASRModel.from_config_dict(cfg.model.spec_augment)                                                                                                                         
                                                                                                                                                                                                            
+    if hasattr(asr_model, 'change_decoding_strategy') and hasattr(asr_model, 'decoding'):                                                                                                                  
+        # This is not robust to all model types.                                                                                                                                                           
+        # import ipdb; ipdb.set_trace()                                                                                                                                                                    
+        decoding_cfg = cfg.model.decoding                                                                                                                                                                  
+        asr_model.change_decoding_strategy(decoding_cfg)                                                                                                                                                   
+                                                                                                                                                                                                           
     trainer.fit(asr_model)

Then I ran this script:

for i in $(seq 1 100); do

python examples/asr/speech_to_text_finetune.py \
       --config-path="conf/asr_finetune" --config-name="speech_to_text_hf_finetune" \
       ~model.train_ds.hf_data_cfg \
       model.train_ds.num_workers=1 \
       model.train_ds.batch_size=2 model.validation_ds.batch_size=2 \
       model.train_ds.streaming=true \
       +model.train_ds.hf_data_cfg.path="librispeech_asr" \
       +model.train_ds.hf_data_cfg.name=null \
       +model.train_ds.hf_data_cfg.split="test.clean" \
       +model.train_ds.hf_data_cfg.streaming=true \
       ~model.validation_ds.hf_data_cfg \
       model.validation_ds.streaming=true \
       +model.validation_ds.hf_data_cfg.path="librispeech_asr" \
       +model.validation_ds.hf_data_cfg.name=null \
       +model.validation_ds.hf_data_cfg.split="test.clean" \
       +model.validation_ds.hf_data_cfg.streaming=true \
       ~model.test_ds \
       init_from_pretrained_model="stt_en_fastconformer_transducer_large" \
       model.tokenizer.update_tokenizer=False \
       model.optim.sched.warmup_steps=0 \
       +model.optim.sched.max_steps=3 \
       trainer.max_epochs=null \
       trainer.devices=1 \
       trainer.accelerator="gpu" \
       +trainer.fast_dev_run=True \
       model.decoding.greedy.loop_labels=True \
       exp_manager.exp_dir=examples/asr/speech_finetuning_results
done

for i in $(seq 1 100); do
    python examples/asr/speech_to_text_finetune.py \
       --config-path="conf/asr_finetune" --config-name="speech_to_text_hf_finetune" \
       ~model.train_ds.hf_data_cfg \
       model.train_ds.num_workers=1 \
       model.train_ds.batch_size=2 model.validation_ds.batch_size=2 \
       model.train_ds.streaming=true \
       +model.train_ds.hf_data_cfg.path="librispeech_asr" \
       +model.train_ds.hf_data_cfg.name=null \
       +model.train_ds.hf_data_cfg.split="test.clean" \
       +model.train_ds.hf_data_cfg.streaming=true \
       ~model.validation_ds.hf_data_cfg \
       model.validation_ds.streaming=true \
       +model.validation_ds.hf_data_cfg.path="librispeech_asr" \
       +model.validation_ds.hf_data_cfg.name=null \
       +model.validation_ds.hf_data_cfg.split="test.clean" \
       +model.validation_ds.hf_data_cfg.streaming=true \
       ~model.test_ds \
       init_from_pretrained_model="stt_en_fastconformer_transducer_large" \
       model.tokenizer.update_tokenizer=False \
       model.optim.sched.warmup_steps=0 \
       +model.optim.sched.max_steps=3 \
       trainer.max_epochs=null \
       trainer.devices=1 \
       trainer.accelerator="gpu" \
       +trainer.fast_dev_run=True \
       model.decoding.greedy.loop_labels=False \
       exp_manager.exp_dir=examples/asr/speech_finetuning_results
done

Basically, I needed to add a way for speech_to_text_finetune.py to be able to modify the decoding algorithm, so I could test both the loop frames and loop labels code paths. I do not include this code in the PR, since it is not robust to all model types (e.g., AED). Since I run 100 times for each algorithm, we can be pretty sure that this fixes the problem.

* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Apply isort and black reformatting Signed-off-by: galv <galv@users.noreply.github.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com>

…c3a6edf123814

* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

github-actions bot added ASR cherry-pick Run CICD labels May 29, 2024

github-actions bot requested a review from galv May 29, 2024 22:04

galv approved these changes May 29, 2024

View reviewed changes

Merge branch 'main' into cherry-pick-main-4cefd5d3636d6702a94b2c1d6a6…

6527dad

…c3a6edf123814

titu1994 added Run CICD and removed Run CICD labels Jun 2, 2024

galv merged commit 677203a into main Jun 5, 2024
133 checks passed

galv deleted the cherry-pick-main-4cefd5d3636d6702a94b2c1d6a6c3a6edf123814 branch June 5, 2024 01:02

ko3n1g mentioned this pull request Jul 18, 2024

Release 2.0.0rc1 #9786

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable cuda graphs in training modes. #9343

Re-enable cuda graphs in training modes. #9343

github-actions bot commented May 29, 2024

Re-enable cuda graphs in training modes. #9343

Re-enable cuda graphs in training modes. #9343

Conversation

github-actions bot commented May 29, 2024

What does this PR do ?

Usage

GitHub Actions CI