Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable cuda graphs in training modes. #9343

Merged

Conversation

github-actions[bot]
Copy link
Contributor

"global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0.

This would cause the ASR_dev_run_Speech_To_Text_HF_Finetuning CI/CD test to fail sporadically (maybe 1 out of 5 times).

What does this PR do ?

Fixes the crash by using "thread_local" stream capture instead of "global" stream capture.

Collection: ASR

Usage

Cuda graphs will now be used by default for inference in both training and inference scripts, following the previous behavior.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

PR Type:

  • New Feature
  • [ X ] Bugfix
  • Documentation

Note that I tested this by applying the following diff:

modified   examples/asr/conf/asr_finetune/speech_to_text_hf_finetune.yaml                                                                                                                                   
@@ -134,6 +134,14 @@ model:                                                                                                                                                                                 
       warmup_steps: 5000                                                                                                                                                                                   
       warmup_ratio: null                                                                                                                                                                                   
       min_lr: 5e-6                                                                                                                                                                                         
+  decoding:                                                                                                                                                                                                
+    strategy: "greedy_batch"                                                                                                                                                                               
+                                                                                                                                                                                                           
+    # greedy strategy config                                                                                                                                                                               
+    greedy:                                                                                                                                                                                                
+      max_symbols: 5                                                                                                                                                                                       
+      loop_labels: false                                                                                                                                                                                   
+      use_cuda_graph_decoder: true                                                                                                                                                                         
                                                                                                                                                                                                            
 trainer:                                                                                                                                                                                                   
   devices: -1 # number of GPUs, -1 would use all available GPUs                                                                                                                                            
modified   examples/asr/speech_to_text_finetune.py                                                                                                                                                          
@@ -212,6 +212,12 @@ def main(cfg):                                                                                                                                                                         
     if hasattr(cfg.model, 'spec_augment') and cfg.model.spec_augment is not None:                                                                                                                          
         asr_model.spec_augment = ASRModel.from_config_dict(cfg.model.spec_augment)                                                                                                                         
                                                                                                                                                                                                            
+    if hasattr(asr_model, 'change_decoding_strategy') and hasattr(asr_model, 'decoding'):                                                                                                                  
+        # This is not robust to all model types.                                                                                                                                                           
+        # import ipdb; ipdb.set_trace()                                                                                                                                                                    
+        decoding_cfg = cfg.model.decoding                                                                                                                                                                  
+        asr_model.change_decoding_strategy(decoding_cfg)                                                                                                                                                   
+                                                                                                                                                                                                           
     trainer.fit(asr_model)                                                                                                                                                                                 

Then I ran this script:

for i in $(seq 1 100); do

python examples/asr/speech_to_text_finetune.py \
       --config-path="conf/asr_finetune" --config-name="speech_to_text_hf_finetune" \
       ~model.train_ds.hf_data_cfg \
       model.train_ds.num_workers=1 \
       model.train_ds.batch_size=2 model.validation_ds.batch_size=2 \
       model.train_ds.streaming=true \
       +model.train_ds.hf_data_cfg.path="librispeech_asr" \
       +model.train_ds.hf_data_cfg.name=null \
       +model.train_ds.hf_data_cfg.split="test.clean" \
       +model.train_ds.hf_data_cfg.streaming=true \
       ~model.validation_ds.hf_data_cfg \
       model.validation_ds.streaming=true \
       +model.validation_ds.hf_data_cfg.path="librispeech_asr" \
       +model.validation_ds.hf_data_cfg.name=null \
       +model.validation_ds.hf_data_cfg.split="test.clean" \
       +model.validation_ds.hf_data_cfg.streaming=true \
       ~model.test_ds \
       init_from_pretrained_model="stt_en_fastconformer_transducer_large" \
       model.tokenizer.update_tokenizer=False \
       model.optim.sched.warmup_steps=0 \
       +model.optim.sched.max_steps=3 \
       trainer.max_epochs=null \
       trainer.devices=1 \
       trainer.accelerator="gpu" \
       +trainer.fast_dev_run=True \
       model.decoding.greedy.loop_labels=True \
       exp_manager.exp_dir=examples/asr/speech_finetuning_results
done

for i in $(seq 1 100); do
    python examples/asr/speech_to_text_finetune.py \
       --config-path="conf/asr_finetune" --config-name="speech_to_text_hf_finetune" \
       ~model.train_ds.hf_data_cfg \
       model.train_ds.num_workers=1 \
       model.train_ds.batch_size=2 model.validation_ds.batch_size=2 \
       model.train_ds.streaming=true \
       +model.train_ds.hf_data_cfg.path="librispeech_asr" \
       +model.train_ds.hf_data_cfg.name=null \
       +model.train_ds.hf_data_cfg.split="test.clean" \
       +model.train_ds.hf_data_cfg.streaming=true \
       ~model.validation_ds.hf_data_cfg \
       model.validation_ds.streaming=true \
       +model.validation_ds.hf_data_cfg.path="librispeech_asr" \
       +model.validation_ds.hf_data_cfg.name=null \
       +model.validation_ds.hf_data_cfg.split="test.clean" \
       +model.validation_ds.hf_data_cfg.streaming=true \
       ~model.test_ds \
       init_from_pretrained_model="stt_en_fastconformer_transducer_large" \
       model.tokenizer.update_tokenizer=False \
       model.optim.sched.warmup_steps=0 \
       +model.optim.sched.max_steps=3 \
       trainer.max_epochs=null \
       trainer.devices=1 \
       trainer.accelerator="gpu" \
       +trainer.fast_dev_run=True \
       model.decoding.greedy.loop_labels=False \
       exp_manager.exp_dir=examples/asr/speech_finetuning_results
done

Basically, I needed to add a way for speech_to_text_finetune.py to be able to modify the decoding algorithm, so I could test both the loop frames and loop labels code paths. I do not include this code in the PR, since it is not robust to all model types (e.g., AED). Since I run 100 times for each algorithm, we can be pretty sure that this fixes the problem.

* Re-enable cuda graphs in training modes.

"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: galv <galv@users.noreply.github.com>

---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: galv <galv@users.noreply.github.com>
@titu1994 titu1994 added Run CICD and removed Run CICD labels Jun 2, 2024
@galv galv merged commit 677203a into main Jun 5, 2024
133 checks passed
@galv galv deleted the cherry-pick-main-4cefd5d3636d6702a94b2c1d6a6c3a6edf123814 branch June 5, 2024 01:02
BoxiangW pushed a commit to BoxiangW/NeMo that referenced this pull request Jun 5, 2024
* Re-enable cuda graphs in training modes.

"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.

* Apply isort and black reformatting

---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: galv <galv@users.noreply.github.com>
Co-authored-by: Daniel Galvez <galv@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
janekl pushed a commit that referenced this pull request Jun 12, 2024
* Re-enable cuda graphs in training modes.

"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.



* Apply isort and black reformatting



---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: galv <galv@users.noreply.github.com>
Co-authored-by: Daniel Galvez <galv@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Re-enable cuda graphs in training modes.

"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.



* Apply isort and black reformatting



---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: galv <galv@users.noreply.github.com>
Co-authored-by: Daniel Galvez <galv@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
@ko3n1g ko3n1g mentioned this pull request Jul 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants