Skip to content

Commit

Permalink
Remove mask if use fusion mask (#9723)
Browse files Browse the repository at this point in the history
* Remove mask if use fusion mask

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hsiehjackson <hsiehjackson@users.noreply.github.com>

---------

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: hsiehjackson <hsiehjackson@users.noreply.github.com>
Co-authored-by: hsiehjackson <hsiehjackson@users.noreply.github.com>
  • Loading branch information
hsiehjackson and hsiehjackson committed Jul 15, 2024
1 parent 02ff85b commit 27b5c47
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions nemo/collections/nlp/modules/common/text_generation_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -794,6 +794,9 @@ def generate(
if random_seed is not None:
seed_everything(random_seed)

if hasattr(model, 'get_attention_mask_from_fusion') and model.get_attention_mask_from_fusion:
compute_attention_mask = False

output = synced_generate(
model,
inference_strategy,
Expand Down

5 comments on commit 27b5c47

@Slyne
Copy link
Contributor

@Slyne Slyne commented on 27b5c47 Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaoyu-33, @xuanzic finds our neva_evaluation.py not working on main branch. After checking, we find this commit breaks neva evaluation since in neva, the compute_attention_mask is always assumed to be True so the attention_mask will always be computed. This line makes attention_mask to be None. To be specific: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py#L1069

@Slyne
Copy link
Contributor

@Slyne Slyne commented on 27b5c47 Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsiehjackson For vis.

@hsiehjackson
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Slyne I think you need to set get_attention_mask_from_fusion: False in your config https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L155

@Slyne
Copy link
Contributor

@Slyne Slyne commented on 27b5c47 Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Slyne I think you need to set get_attention_mask_from_fusion: False in your config https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L155

@hsiehjackson It can be a bit inappropriate if users set compute_attention_mask=True on purpose and it actually uses False (get_attention_mask_from_fusion) without giving warning or error. And generate() function is public and would be called by some other functions.
Could you help check if you can move this state outside the generate() function? Such as megatron_gpt_generate(). Or simply move this statement to model forward function to set attention_mask to None when get_attention_mask_from_fusion is set to True.

@hsiehjackson
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Slyne The reason we want to set False is because we face OOM when creating a giant 2D attention mask for long context. Usually, we use flash attention (built-in causal masking), so we don't need to create this tensor to occupy GPU memory. Move to forward function is a good suggestion; however, the creation of attention mask is under text generation strategy here. Can you check whether attention mask is None before calling .cuda() like MegatronGPTModel? It would be great if you can also prevent creating the attention mask if you are using causal mask with flash attention.

Please sign in to comment.