Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable cuda graphs in training modes. #9338

Merged
merged 2 commits into from
May 29, 2024

Conversation

galv
Copy link
Collaborator

@galv galv commented May 29, 2024

"global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0.

This would cause the ASR_dev_run_Speech_To_Text_HF_Finetuning CI/CD test to fail sporadically (maybe 1 out of 5 times).

What does this PR do ?

Fixes the crash by using "thread_local" stream capture instead of "global" stream capture.

Collection: ASR

Usage

Cuda graphs will now be used by default for inference in both training and inference scripts, following the previous behavior.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

PR Type:

  • New Feature
  • [ X ] Bugfix
  • Documentation

Note that I tested this by applying the following diff:

modified   examples/asr/conf/asr_finetune/speech_to_text_hf_finetune.yaml                                                                                                                                   
@@ -134,6 +134,14 @@ model:                                                                                                                                                                                 
       warmup_steps: 5000                                                                                                                                                                                   
       warmup_ratio: null                                                                                                                                                                                   
       min_lr: 5e-6                                                                                                                                                                                         
+  decoding:                                                                                                                                                                                                
+    strategy: "greedy_batch"                                                                                                                                                                               
+                                                                                                                                                                                                           
+    # greedy strategy config                                                                                                                                                                               
+    greedy:                                                                                                                                                                                                
+      max_symbols: 5                                                                                                                                                                                       
+      loop_labels: false                                                                                                                                                                                   
+      use_cuda_graph_decoder: true                                                                                                                                                                         
                                                                                                                                                                                                            
 trainer:                                                                                                                                                                                                   
   devices: -1 # number of GPUs, -1 would use all available GPUs                                                                                                                                            
modified   examples/asr/speech_to_text_finetune.py                                                                                                                                                          
@@ -212,6 +212,12 @@ def main(cfg):                                                                                                                                                                         
     if hasattr(cfg.model, 'spec_augment') and cfg.model.spec_augment is not None:                                                                                                                          
         asr_model.spec_augment = ASRModel.from_config_dict(cfg.model.spec_augment)                                                                                                                         
                                                                                                                                                                                                            
+    if hasattr(asr_model, 'change_decoding_strategy') and hasattr(asr_model, 'decoding'):                                                                                                                  
+        # This is not robust to all model types.                                                                                                                                                           
+        # import ipdb; ipdb.set_trace()                                                                                                                                                                    
+        decoding_cfg = cfg.model.decoding                                                                                                                                                                  
+        asr_model.change_decoding_strategy(decoding_cfg)                                                                                                                                                   
+                                                                                                                                                                                                           
     trainer.fit(asr_model)                                                                                                                                                                                 

Then I ran this script:

for i in $(seq 1 100); do

python examples/asr/speech_to_text_finetune.py \
       --config-path="conf/asr_finetune" --config-name="speech_to_text_hf_finetune" \
       ~model.train_ds.hf_data_cfg \
       model.train_ds.num_workers=1 \
       model.train_ds.batch_size=2 model.validation_ds.batch_size=2 \
       model.train_ds.streaming=true \
       +model.train_ds.hf_data_cfg.path="librispeech_asr" \
       +model.train_ds.hf_data_cfg.name=null \
       +model.train_ds.hf_data_cfg.split="test.clean" \
       +model.train_ds.hf_data_cfg.streaming=true \
       ~model.validation_ds.hf_data_cfg \
       model.validation_ds.streaming=true \
       +model.validation_ds.hf_data_cfg.path="librispeech_asr" \
       +model.validation_ds.hf_data_cfg.name=null \
       +model.validation_ds.hf_data_cfg.split="test.clean" \
       +model.validation_ds.hf_data_cfg.streaming=true \
       ~model.test_ds \
       init_from_pretrained_model="stt_en_fastconformer_transducer_large" \
       model.tokenizer.update_tokenizer=False \
       model.optim.sched.warmup_steps=0 \
       +model.optim.sched.max_steps=3 \
       trainer.max_epochs=null \
       trainer.devices=1 \
       trainer.accelerator="gpu" \
       +trainer.fast_dev_run=True \
       model.decoding.greedy.loop_labels=True \
       exp_manager.exp_dir=examples/asr/speech_finetuning_results
done

for i in $(seq 1 100); do
    python examples/asr/speech_to_text_finetune.py \
       --config-path="conf/asr_finetune" --config-name="speech_to_text_hf_finetune" \
       ~model.train_ds.hf_data_cfg \
       model.train_ds.num_workers=1 \
       model.train_ds.batch_size=2 model.validation_ds.batch_size=2 \
       model.train_ds.streaming=true \
       +model.train_ds.hf_data_cfg.path="librispeech_asr" \
       +model.train_ds.hf_data_cfg.name=null \
       +model.train_ds.hf_data_cfg.split="test.clean" \
       +model.train_ds.hf_data_cfg.streaming=true \
       ~model.validation_ds.hf_data_cfg \
       model.validation_ds.streaming=true \
       +model.validation_ds.hf_data_cfg.path="librispeech_asr" \
       +model.validation_ds.hf_data_cfg.name=null \
       +model.validation_ds.hf_data_cfg.split="test.clean" \
       +model.validation_ds.hf_data_cfg.streaming=true \
       ~model.test_ds \
       init_from_pretrained_model="stt_en_fastconformer_transducer_large" \
       model.tokenizer.update_tokenizer=False \
       model.optim.sched.warmup_steps=0 \
       +model.optim.sched.max_steps=3 \
       trainer.max_epochs=null \
       trainer.devices=1 \
       trainer.accelerator="gpu" \
       +trainer.fast_dev_run=True \
       model.decoding.greedy.loop_labels=False \
       exp_manager.exp_dir=examples/asr/speech_finetuning_results
done

Basically, I needed to add a way for speech_to_text_finetune.py to be able to modify the decoding algorithm, so I could test both the loop frames and loop labels code paths. I do not include this code in the PR, since it is not robust to all model types (e.g., AED). Since I run 100 times for each algorithm, we can be pretty sure that this fixes the problem.

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change is capture error mode ?

@galv
Copy link
Collaborator Author

galv commented May 29, 2024

@titu1994 yes, that is the only change, and I am very confident in it. I can elaborate if you want.

@galv
Copy link
Collaborator Author

galv commented May 29, 2024

Well, I also had to undo Vladimir's previous commit that turns cuda graphs off by default, except in transcribe_speech.py and transcribe_speech_parallel.py. This commit's changes: bb26e9846f

Copy link
Collaborator

@artbataev artbataev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galv, very cool, thank you!
Please, add these changes also to nemo/collections/asr/parts/submodules/tdt_loop_labels_computer.py

@galv
Copy link
Collaborator Author

galv commented May 29, 2024

@artbataev good point. I completely missed TDT. Done. I'm not sure how to test that one, but I suspect that the change is low risk anyway.

@galv galv force-pushed the fix-cudnn-cuda-graph-error branch 2 times, most recently from deb8d28 to 9c52705 Compare May 29, 2024 17:49
"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
@galv galv force-pushed the fix-cudnn-cuda-graph-error branch from 9c52705 to 45a2981 Compare May 29, 2024 18:36
@galv galv added Run CICD and removed Run CICD labels May 29, 2024
Signed-off-by: galv <galv@users.noreply.github.com>
@galv galv merged commit 4cefd5d into NVIDIA:r2.0.0rc0 May 29, 2024
109 checks passed
github-actions bot pushed a commit that referenced this pull request May 29, 2024
* Re-enable cuda graphs in training modes.

"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: galv <galv@users.noreply.github.com>

---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: galv <galv@users.noreply.github.com>
galv added a commit that referenced this pull request Jun 5, 2024
* Re-enable cuda graphs in training modes.

"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.



* Apply isort and black reformatting



---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: galv <galv@users.noreply.github.com>
Co-authored-by: Daniel Galvez <galv@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
BoxiangW pushed a commit to BoxiangW/NeMo that referenced this pull request Jun 5, 2024
* Re-enable cuda graphs in training modes.

"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.

* Apply isort and black reformatting

---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: galv <galv@users.noreply.github.com>
Co-authored-by: Daniel Galvez <galv@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
janekl pushed a commit that referenced this pull request Jun 12, 2024
* Re-enable cuda graphs in training modes.

"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.



* Apply isort and black reformatting



---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: galv <galv@users.noreply.github.com>
Co-authored-by: Daniel Galvez <galv@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Re-enable cuda graphs in training modes.

"global" capture mode was sporadically crashing because of pinning
host memory in other threads spawned by the data loader when
num_workers > 0.

Add relevant changs to TDT cuda graphs decoding as well.

I didn't test the TDT change because I'm not sure how. But it seems low risk.



* Apply isort and black reformatting



---------

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: galv <galv@users.noreply.github.com>
Co-authored-by: Daniel Galvez <galv@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants