[Longformer] Major Refactor #5219

patrickvonplaten · 2020-06-23T16:50:28Z

Longformer Refactor

This PR does a major refactoring of Longformer. Mainly, the Roberta abstraction is removed and compositionally is chosen instead. This has the following advantages:

It's easier now to implement a cross_attention_layer
The code is more readable and the logic stays in this file only
A bug was corrected regarding the attention mask. @ibeltagy - maybe you can check this as well. Previously, if no attention_mask was inserted, the padding function that became before super.forward() in LongformerModel was not used, but if instead an attention_mask = torch.tensor([1, ..., 1]) (attend to all tokens was passed, the padding function was applied and could lead to different outputs as when no attention_mask is passed. This should not be the case. model(input_ids) and model(input_ids, attention_mask=torch.ones(input_ids.shape)) should always yield the same result. Removing the super.forward() abstraction makes the code much cleaner here so that a attention_mask = torch.ones(input_ids.shape) can be calculated before calling the longformer encoder. IMPORTANT Since in almost all tasks longformer somehow passes either a global_attention_mask or attention_mask to LongformerModel, this bug did not really become visible before.
We don't have to "inject" a self-attention layer into another model anymore, which I did not like very much.
Unnecessary code can be removed (head_mask, prev cross-attention layer inputs that do not work yet), ...

Additionally:

Variable names are made more explicit and dead code (If statements that would have never occurred) was removed and code is simplified.
The forward function of the self-attention layer is broken up into multiple helper functions. The advantage here is that quite some memory should be saved before attention_probs go out of scope after they are not used anymore and thus the memory bottleneck should be reduced.
All longformer models are added to the tests (@sgugger) and a couple more tests are added.

Next step is to add cross attention layers to longformer.

Review

I made sure that besides the bug with attention_mask = None vs attention_mask = torch.ones(...) all outputs stay the same.
Would be great if @thomwolf @LysandreJik @sgugger @sshleifer @ibeltagy can do a quick review.

codecov · 2020-06-23T16:59:08Z

Codecov Report

Merging #5219 into master will decrease coverage by 0.81%.
The diff coverage is 92.60%.

@@            Coverage Diff             @@
##           master    #5219      +/-   ##
==========================================
- Coverage   77.85%   77.04%   -0.82%     
==========================================
  Files         138      138              
  Lines       24314    24409      +95     
==========================================
- Hits        18930    18806     -124     
- Misses       5384     5603     +219

Impacted Files	Coverage Δ
src/transformers/modeling_longformer.py	`91.66% <92.60%> (-1.45%)`	⬇️
src/transformers/modeling_tf_mobilebert.py	`23.62% <0.00%> (-73.11%)`	⬇️
src/transformers/modeling_tf_bert.py	`73.37% <0.00%> (-25.00%)`	⬇️
src/transformers/modeling_tf_utils.py	`87.39% <0.00%> (-0.15%)`	⬇️
src/transformers/modeling_openai.py	`81.09% <0.00%> (+1.37%)`	⬆️
src/transformers/modeling_tf_distilbert.py	`98.76% <0.00%> (+32.51%)`	⬆️
src/transformers/modeling_tf_openai.py	`94.98% <0.00%> (+74.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a473f1...90d2aa6. Read the comment docs.

patrickvonplaten · 2020-06-30T08:32:30Z

tests/test_modeling_common.py

@@ -812,7 +812,7 @@ def test_multigpu_data_parallel_forward(self):
            # Wrap model in nn.DataParallel
            model = torch.nn.DataParallel(model)
            with torch.no_grad():
-                _ = model(**inputs_dict)
+                _ = model(**self._prepare_for_class(inputs_dict, model_class))


@sgugger added the prepare function here because otherwise longformer tests were failing for multiple choice.

patrickvonplaten · 2020-06-30T08:33:17Z

tests/test_modeling_longformer.py

@@ -115,6 +115,18 @@ def prepare_config_and_inputs(self):
    def check_loss_output(self, result):
        self.parent.assertListEqual(list(result["loss"].size()), [])

+    def create_and_check_attention_mask_determinism(


There was a bug previously in that running the model without attention_mask and with attention_mask = torch.tensor([1, ..., 1]) did not give the same output.

patrickvonplaten · 2020-06-30T08:33:47Z

tests/test_modeling_longformer.py

+        self.model_tester.create_and_check_attention_mask_determinism(*config_and_inputs)
+
+    def test_longformer_model_global_attention_mask(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()


A test for global attention mask was missing before

patrickvonplaten · 2020-06-30T08:34:10Z

tests/test_modeling_longformer.py

+
+    @slow
+    def test_inference_no_head_long(self):
+        model = LongformerModel.from_pretrained("allenai/longformer-base-4096")


Add one slow test that a 16GB computer can run

patrickvonplaten · 2020-06-30T08:34:52Z

tests/test_modeling_longformer.py

-        expected_loss = torch.tensor(0.0620, device=torch_device)
-        expected_prediction_scores_sum = torch.tensor(-6.1599e08, device=torch_device)
-        expected_prediction_scores_mean = torch.tensor(-3.0622, device=torch_device)
+        expected_loss = torch.tensor(0.0074, device=torch_device)


Previously the model calculated the wrong attention_mask when no attention_mask was given -> update the values here.

patrickvonplaten · 2020-06-30T08:35:07Z

src/transformers/modeling_longformer.py

-        x = x.view(B, C, M, M + L)  # B x C, M x L+M
-        x = x[:, :, :, :-1]
-        return x
+        total_num_heads, num_chunks, window_overlap, hidden_dim = chunked_hidden_states.size()


better naming

Love the better naming

why total_num_heads? Is there another num_heads?

@sshleifer, total_num_heads = num_heads * batch_size

patrickvonplaten · 2020-06-30T08:35:40Z

src/transformers/modeling_longformer.py


-        if attention_mask is not None:


simplify forward call and move a lot of code to helper functions

LysandreJik

Great changes overall, I love that the functions/variables names became more explicit. The code looks overall closer to the library's philosophy, which is a welcome change!

LysandreJik · 2020-06-30T13:39:05Z

src/transformers/modeling_longformer.py

-        x = x.view(B, C, M, M + L)  # B x C, M x L+M
-        x = x[:, :, :, :-1]
-        return x
+        total_num_heads, num_chunks, window_overlap, hidden_dim = chunked_hidden_states.size()


Love the better naming

LysandreJik · 2020-06-30T13:42:21Z

tests/test_modeling_longformer.py

-            # TODO: make tests pass for those models
-            # LongformerForSequenceClassification,
-            # LongformerForQuestionAnswering,
-            # LongformerForTokenClassification,
-            # LongformerForMultipleChoice,
+            LongformerForSequenceClassification,
+            LongformerForQuestionAnswering,
+            LongformerForTokenClassification,
+            LongformerForMultipleChoice,


very cool diff

sgugger

This looks great! And I love that the model get properly tested now :-)

sshleifer

Halfway, sorry for all the nits. Feel free to ignore them! This is really cool!

src/transformers/modeling_longformer.py

sshleifer · 2020-06-30T18:10:26Z

src/transformers/modeling_longformer.py

-        x = x.view(B, C, M, M + L)  # B x C, M x L+M
-        x = x[:, :, :, :-1]
-        return x
+        total_num_heads, num_chunks, window_overlap, hidden_dim = chunked_hidden_states.size()


why total_num_heads? Is there another num_heads?

src/transformers/modeling_longformer.py

ibeltagy

This looks great. It must have been a lot of work, thanks, @patrickvonplaten. I checked the attention_mask bug you mentioned and your fix is working well, thanks for addressing it. I also left a few comments, mostly nits, so feel free to address or ignore as you see fit.
Thanks.

src/transformers/modeling_longformer.py

ibeltagy · 2020-07-01T01:58:01Z

src/transformers/modeling_longformer.py

-        x = x.view(B, C, M, M + L)  # B x C, M x L+M
-        x = x[:, :, :, :-1]
-        return x
+        total_num_heads, num_chunks, window_overlap, hidden_dim = chunked_hidden_states.size()


@sshleifer, total_num_heads = num_heads * batch_size

src/transformers/modeling_longformer.py

patrickvonplaten · 2020-07-01T15:29:48Z

@sshleifer and @ibeltagy - thanks a lot for your comments -> cleaned up the comments and some function naming.

All slow and normal tests pass on GPU => good to merge.

patrickvonplaten changed the title ~~[WIP - Don't merge!] Refactor longformer~~ [WIP - Don't merge] Refactor longformer Jun 23, 2020

patrickvonplaten force-pushed the clean_longformer branch from 6e200ea to 8185265 Compare June 29, 2020 17:40

patrickvonplaten added 17 commits June 30, 2020 08:17

fix longformer

f5ccdca

fix longformer

18519d6

fix longformer

16d956e

refactor naming

01d3d84

add small slow test

d01fdf7

refactor

18185a5

refactor naming

b7c2cd0

rename selected to extra

6bb7379

big global attention refactor

0c29a39

make style

3012770

refactor naming

e66015b

save intermed

a3cf0e7

refactor functions

977f3e2

finish function refactor

961ffbb

fix tests

3d4952a

fix all tests but one

390bf01

finish longformer

149b057

patrickvonplaten force-pushed the clean_longformer branch from 8185265 to 149b057 Compare June 30, 2020 08:31

patrickvonplaten changed the title ~~[WIP - Don't merge] Refactor longformer~~ Refactor longformer Jun 30, 2020

patrickvonplaten commented Jun 30, 2020

View reviewed changes

patrickvonplaten requested a review from sgugger June 30, 2020 08:50

patrickvonplaten requested review from sshleifer and LysandreJik June 30, 2020 08:50

patrickvonplaten changed the title ~~Refactor longformer~~ [Longformer] Major Refactor Jun 30, 2020

patrickvonplaten requested a review from thomwolf June 30, 2020 08:57

LysandreJik approved these changes Jun 30, 2020

View reviewed changes

sgugger approved these changes Jun 30, 2020

View reviewed changes

sshleifer reviewed Jun 30, 2020

View reviewed changes

ibeltagy approved these changes Jul 1, 2020

View reviewed changes

patrickvonplaten added 2 commits July 1, 2020 17:20

address sams and izs comments

d0e9f0f

fix transpose

90d2aa6

patrickvonplaten merged commit d697b6c into huggingface:master Jul 1, 2020

HHousen mentioned this pull request Jul 14, 2020

Summarization Fine Tuning #4406

Closed

patrickvonplaten mentioned this pull request Jul 16, 2020

[Longformer] fix longformer slow-down #5811

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Longformer] Major Refactor #5219

[Longformer] Major Refactor #5219

patrickvonplaten commented Jun 23, 2020 •

edited

Loading

codecov bot commented Jun 23, 2020 •

edited

Loading

patrickvonplaten Jun 30, 2020 •

edited

Loading

patrickvonplaten Jun 30, 2020 •

edited

Loading

patrickvonplaten Jun 30, 2020

patrickvonplaten Jun 30, 2020

patrickvonplaten Jun 30, 2020

patrickvonplaten Jun 30, 2020

LysandreJik Jun 30, 2020

sshleifer Jun 30, 2020

ibeltagy Jul 1, 2020

patrickvonplaten Jun 30, 2020

LysandreJik left a comment

LysandreJik Jun 30, 2020

LysandreJik Jun 30, 2020

sgugger left a comment

sshleifer left a comment

sshleifer Jun 30, 2020

ibeltagy left a comment •

edited

Loading

ibeltagy Jul 1, 2020

patrickvonplaten commented Jul 1, 2020

[Longformer] Major Refactor #5219

[Longformer] Major Refactor #5219

Conversation

patrickvonplaten commented Jun 23, 2020 • edited Loading

Longformer Refactor

codecov bot commented Jun 23, 2020 • edited Loading

Codecov Report

patrickvonplaten Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

patrickvonplaten Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

sshleifer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibeltagy left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Jul 1, 2020

patrickvonplaten commented Jun 23, 2020 •

edited

Loading

codecov bot commented Jun 23, 2020 •

edited

Loading

patrickvonplaten Jun 30, 2020 •

edited

Loading

patrickvonplaten Jun 30, 2020 •

edited

Loading

ibeltagy left a comment •

edited

Loading