Reduce inference time by 30% when using Flair embeddings #1074

pommedeterresautee · 2019-09-05T19:21:58Z

It appeared on profiler (screenshots on #1070) that Tensor.clone() was one of the slowest single operations during inference.
Looking at the code, it's always called even when gradient is disabled.
My understanding is that when gradient is disabled, embeddings can't change and so there is no need to clone the Tensor to avoid it to be updated. That's what does this commit.

I have made tests on inference, but not on LM training, I don't think it changes anything but let me know if I misses something.

Time on 2080 TI with storage set to None on my French dataset.

before 36 secs
after 25 secs

FYI, before I was doing measures on storage set to "cpu", that's why before time decreased a lot compared to PR #1068 (43 secs after optimization). Details on #1070

Add a warning when cpu option is used with a GPU, plus update the documentation.
fix #1070

All evaluation where still having embedding storage set to cpu, I moved them to none

flair/trainers/trainer.py

resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md

pommedeterresautee · 2019-09-06T09:56:47Z

@alanakbik for

                    if self.fine_tune:
                        embedding = embedding.clone()
                    else:
                        embedding = embedding.detach()

Does detach makes sense? (because there is no gradient)

alanakbik · 2019-09-09T07:00:12Z

@pommedeterresautee sorry for the delay - I'm planning to look into #1078 and review this PR today!

alanakbik · 2019-09-09T14:04:30Z

Hello @pommedeterresautee thanks for the PR! Prediction speed is well increased when setting storage mode to 'cpu'.

However, when setting storage mode to 'gpu', I now get OOM errors. For instance:

from flair.datasets import WNUT_17
from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings
from flair.models import SequenceTagger

corpus = WNUT_17(in_memory=True)

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# embeddings
embeddings = StackedEmbeddings([
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
    WordEmbeddings('glove'),
])

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        )

# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# # 7. start training
trainer.train(
    f'resources/taggers/local-test',
    max_epochs=5,
    train_with_dev=True,
    embeddings_storage_mode='gpu',
)

This previously trained well with GPU storage mode, but now is throwing memory errors. It may be that without the .clone() it somehow does not release the memory from the language model. That's strange because a .detach() seems like it should be enough to move the hidden state tensor out of scope, but somehow this isn't happening.

alanakbik · 2019-09-09T19:03:19Z

@pommedeterresautee quick update: I think the problem lies in the line https://github.com/zalandoresearch/flair/blob/e0437d577be28cf7d099df588ad3a997e542df60/flair/embeddings.py#L1856

where we select from all hidden states of all characters in the sentence only the states at one position to become the word embedding. Essentially, what we need to do here is copy these states over to a new tensor so that we can discard all the rest of the huge tensor all_hidden_states_in_lm. As discussed here and elsewhere, the preferred way in PyTorch is to do this via clone(). After the copy, no references to the tensor exist and it gets cleaned up.

With your PR, we don't do the clone() and it still works if the embedding storage mode is either 'cpu' or 'none' because these storage modes cause either the embedding to be deleted ('none') or a new tensor to be created on cpu (essentially another copy operation), thus freeing up all_hidden_states_in_lm. This means we save the clone() operation giving us a significant speed-up.

But if the storage mode is 'gpu', the embedding tensor is preserved and so also all_hidden_states_in_lm which quickly blows up the memory. So the PR as it is speeds up 'none' and 'cpu' but breaks 'gpu'.

pommedeterresautee · 2019-09-09T21:32:14Z

Came to the same conclusion than you.
It s like the new tensor is a kind of view over a part of the large matrix.
It would make sense to proceed that way to avoid copy and share data layer between tensors and the large matrix (tensor keeping a simple reference to the data layer of the matrix).
clone() would perform a real copy on a brand new data layer.
The problem is that if this was the case, each token of the batch would have a reference toward the same matrix, therefore the memory should not raise exception.

I agree that this has no effect on none, but it increases memory consumption on CPU.
Moreover, inside Language Model I don't know which way have been chosen to store embeddings, and can't perform a simple if not "none" then tensor.clone...

Things not working: narrow (to keep the same under the hood data lawyer) and split (to split into list of tensor)

Complicated...

alanakbik · 2019-09-10T06:17:25Z

One potential solution would be to define a global variable. We currently only have two: the flair.cache_root and the flair.device, both defined in the __init__.py:
https://github.com/zalandoresearch/flair/blob/e0437d577be28cf7d099df588ad3a997e542df60/flair/__init__.py#L8

We could add an optimization_settings variable here that initializes as "default". When someone selects mbedding storage mode to be 'cpu', we change this variable. In the FlairEmbeddings class we do the check: if optimization_settings is set to default, we always do the .clone() otherwise we don't.

Drawback is that this adds another global variable but if it is only for non-essential optimization instructions maybe it would be ok in this case. What do you think? I could add this in a separate PR.

pommedeterresautee · 2019-09-10T06:38:31Z

tks @alanakbik.
For now I have reversed the optimization so you can merge this PR and #1022 if you have time.
After that, I will push some other optimizations I have worked on (hoping that they don't come with their own problem :-) ), it will give me some time to work on clone() in parallel.
What do you think @alanakbik ?
We can choose the global variable path when we will be sure there is no other solution to take the benefit of removing clone?

pommedeterresautee · 2019-09-10T08:24:35Z

I think this is a beginning of understanding, the view theory may be the right one, they share storage until clone() is called. What I still don't understand is why it raises OOM, it should not consume more memory if they share the storage...

Calling .data_ptr() on tensor provides same insight.
.is_set_to() is the easier way to check if they share the same data layer (storage)
It explains why detach() is not enough.

Little exp:

                   embedding = all_hidden_states_in_lm[offset, i, :]
                    print(embedding.is_set_to(all_hidden_states_in_lm))
                    print(embedding.data_ptr(), all_hidden_states_in_lm.data_ptr())
                    print(embedding.storage().data_ptr(), all_hidden_states_in_lm.storage().data_ptr())
                    print(embedding.storage()._weak_ref(), all_hidden_states_in_lm.storage()._weak_ref())
                   print(embedding.storage_offset(), all_hidden_states_in_lm.storage_offset())

It print

False
139949060255744 139949059866624
139949059866624 139949059866624
2788225280 2788225280
97280 0

_weak_ref are the same, .storage().data_ptr() are the same but data_ptr are not.
embeddings store an offset over the storage of the large matrix
Interesting link: http://blog.christianperone.com/2018/03/pytorch-internal-architecture-tour/#zero-copy

alanakbik · 2019-09-10T10:12:42Z

Hi @pommedeterresautee yes that sounds good - I'll merge shortly!

Thanks for sharing your findings! I guess because they share the storage, the all_hidden_states_in_lm tensor is prevented from going out of scope and thus garbage collected.

alanakbik · 2019-09-10T12:09:17Z

👍

adizdari · 2019-09-10T13:02:41Z

👍

pommedeterresautee added 3 commits September 5, 2019 20:49

improve inference on Flair embeddings

8f36f85

add warning and update documentation

a51434a

move all embeddings_storage_mode to none

da5a29a

alanakbik reviewed Sep 6, 2019

View reviewed changes

flair/trainers/trainer.py Outdated Show resolved Hide resolved

alanakbik reviewed Sep 6, 2019

View reviewed changes

resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md Show resolved Hide resolved

pommedeterresautee added 2 commits September 6, 2019 11:52

restore cpu default option when it makes sense

9606cff

complete

8a4a998

Merge remote-tracking branch 'upstream/master' into ner

1fe0c40

pommedeterresautee mentioned this pull request Sep 9, 2019

RTX2080 GPU is only being used at 30% - whilst a single CPU core is maxed out #1080

Closed

reverse optimization

8bc166c

alanakbik merged commit 50516e8 into flairNLP:master Sep 10, 2019

pommedeterresautee mentioned this pull request Sep 11, 2019

Star Wars: Episode II – Attack of the clone() #1089

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce inference time by 30% when using Flair embeddings #1074

Reduce inference time by 30% when using Flair embeddings #1074

pommedeterresautee commented Sep 5, 2019 •

edited

Loading

pommedeterresautee commented Sep 6, 2019

alanakbik commented Sep 9, 2019

alanakbik commented Sep 9, 2019

alanakbik commented Sep 9, 2019

pommedeterresautee commented Sep 9, 2019 •

edited

Loading

alanakbik commented Sep 10, 2019

pommedeterresautee commented Sep 10, 2019

pommedeterresautee commented Sep 10, 2019 •

edited

Loading

alanakbik commented Sep 10, 2019

alanakbik commented Sep 10, 2019

adizdari commented Sep 10, 2019

Reduce inference time by 30% when using Flair embeddings #1074

Reduce inference time by 30% when using Flair embeddings #1074

Conversation

pommedeterresautee commented Sep 5, 2019 • edited Loading

pommedeterresautee commented Sep 6, 2019

alanakbik commented Sep 9, 2019

alanakbik commented Sep 9, 2019

alanakbik commented Sep 9, 2019

pommedeterresautee commented Sep 9, 2019 • edited Loading

alanakbik commented Sep 10, 2019

pommedeterresautee commented Sep 10, 2019

pommedeterresautee commented Sep 10, 2019 • edited Loading

alanakbik commented Sep 10, 2019

alanakbik commented Sep 10, 2019

adizdari commented Sep 10, 2019

pommedeterresautee commented Sep 5, 2019 •

edited

Loading

pommedeterresautee commented Sep 9, 2019 •

edited

Loading

pommedeterresautee commented Sep 10, 2019 •

edited

Loading