[Paper + tutorial] The Johnson-Lindenstrauss lemma & Linformer #7

TevenLeScao · 2020-06-18T16:45:18Z

TevenLeScao
Jun 18, 2020

Hey everyone and welcome to Hugging Face reading group! We're trying out the new Github discussions to share papers discussions with the community.

This week will be about Linformer, a very recent paper that breaks the quadratic complexity bottleneck of standard Transformers, and the Johnson-Lindenstrauss lemma, a key high-dimensional geometry result that serves as a dimensionality reduction powerhouse. I wrote a tutorial blog post: it's designed to give the reader intuition about high-dimensional phenomena of concentration of measure to introduce the JL lemma, then shows how Linformer uses this result to do hyper-efficient NLP.

We'd like to use this space to foster research discussions for the community. Here's a few questions we had after reading the paper as a starting point. Feel free to use this thread or to open new ones to discuss them or to open your own debates; we're figuring out what the best format is as we go :)

Do you have other ideas for applications of the Johnson-Lindenstrauss lemma? What in NLP has superfluous dimensions and could be cut down?
What if we used random matrices instead of learning them, either in Linformer or for other places where we learn projections? Assuming random projections work as well as learned ones, would that reduce training time?
Do you know other cool mathematical results that you believe would be helpful to build better models? (I knew nothing about Johnson-Lindenstrauss before this week and now I feel like I should have known it all my life, now I wanna make sure I don't miss the next one !)
[Just for fun] What do you think of step (b) of the proof of theorem 2 in the appendices? Are we actually on a compact set?
Any other questions or comments that you want to put out there?

iacolippo · 2020-06-18T21:18:21Z

iacolippo
Jun 18, 2020

Hi Teven, my name is Iacopo. Thanks for starting this discussion and for the really nice blog post :-)

I’m not an expert in NLP but some members of the ML team @lightonai are looking into what can be done with random matrices in language tasks.

At first glance, I thought that this looks similar to sketched matrix multiplication: finding a matrix S such that the expectation of A^T S^T S B = A^T B. If we remove the softmax from the picture in equation 8 in the paper, we get back to this, I believe. This would point to the fact that sharing the projections E and F is not detrimental, and maybe that a random S (or E, or F) is enough? I’m not sure if having a different random matrix for each attention head would be beneficial or not.

Note that if you choose S and its dimensions properly, and you have a fast way of computing products with S, then you can reduce the computation time by a lot! (we think we have a fast way to do this at LightOn, wink wink). For more about sketched matrix multiplication Sketching as a Tool for Numerical Linear Algebra by David Woodruff, section 2.2. I think it is also covered in the fast.ai course on numerical linear algebra.

About placing random matrices in other locations, there has been some work recently on using random attention matrices in the Synthesizer paper https://arxiv.org/abs/2005.00743 and the LSH in the Reformer: Efficient Transformer https://arxiv.org/abs/2001.04451 is a random-matrix-based hashing scheme.

Really looking forward to seeing other people's comments!

2 replies

TevenLeScao Jun 19, 2020
Author

Hi Iacopo, I've read up on matrix sketching this morning and I agree that it's an important thread here, thanks for bringing it up! Don't think the softmax should matter all that much as it's a very well-behaved function. I agree that if we believe that random matrices are enough, that can explain why why sharing them doesn't matter. One thing I'd maybe be wary about is that the loss curves seem more unstable with more sharing:

Which brings us back to the original idea of multi-head attention i.e. more redundancy to make training more stable.

iacolippo Jun 19, 2020

Yes, good observation!

A possible reason is that one or more of the context maps would need a higher projection dimension and is suffering from distortion during training.

Alternatively, I was recently talking with Rakshith, the author of this paper on localized sketching, where they build a block-diagonal sketching matrix, for the case where the data matrix is not accessible all at once. He mentioned that having different random projection matrices in each block probably helps by increasing the diversity of the sampling, that may be what's happening here.

I wonder if some ideas from the localized sketching paper can be adapted in the context of self-attention and if it even makes sense :D

lapolonio · 2020-07-12T21:04:52Z

lapolonio
Jul 12, 2020

Hello, I tried reimplementing this paper because I thought it would be "easy" and I learned a lot.

I looked at the SST-2 dataset and the max length after tokenization is 72. So when n=512 and k=128 and training and inference are performed are the extra projection parameters still considered linear time and memory?

I'm also not convinced that the long sequence analysis (> 8000) is practical. No training results were reported, and it's not clear what level of training is required to get extreme sequence lengths to converge or perform well on downstream tasks.

3 replies

TevenLeScao Jul 13, 2020
Author

Hi Leonard, I am not sure what you mean by this first question, sorry - in any case the linear time and memory claim regards the sequence length: so with k fixed, when n increases, the time and memory costs increase linearly. Ultra-long sequence analysis doesn't make a huge difference in language modeling, which is the task the paper (pre-)trains on; however, summarization or question answering, for example, could easily make use of a 8000-sized context.

lapolonio Jul 13, 2020

I'm trying to understand how the projection matrices help when empirical n is less than k.

For the GLUE / SST-2 data the max sequence length is 72 which is much less than k when k=128.

Is reporting on the select Glue tasks just a sanity check that they didn't "break" anything?

Code used to determine max sequence length of SST-2

from nlp import load_dataset
dataset = load_dataset('glue', 'sst2')
import pandas as pd

from transformers import *
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
def flatten_and_tokenize(batch):
    input_ids_dict = tokenizer(batch['sentence'])
    return input_ids_dict

df = pd.DataFrame(dataset['train'].map(flatten_and_tokenize))
df["LEN"] = df.input_ids.apply(len)
df.LEN.describe()

count 67349.000000
mean 13.952650
std 10.121111
min 3.000000
25% 6.000000
50% 11.000000
75% 19.000000
max 72.000000
Name: LEN, dtype: float64

TevenLeScao Jul 15, 2020
Author

Ha indeed in this case they don't help at all, they probably slow things down! Reporting on GLUE was to prove that fine-tuning ability was also conserved, and not only language modeling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Paper + tutorial] The Johnson-Lindenstrauss lemma & Linformer #7

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[Paper + tutorial] The Johnson-Lindenstrauss lemma & Linformer #7

TevenLeScao Jun 18, 2020

Replies: 2 comments · 5 replies

iacolippo Jun 18, 2020

TevenLeScao Jun 19, 2020 Author

iacolippo Jun 19, 2020

lapolonio Jul 12, 2020

TevenLeScao Jul 13, 2020 Author

lapolonio Jul 13, 2020

TevenLeScao Jul 15, 2020 Author

TevenLeScao
Jun 18, 2020

Replies: 2 comments 5 replies

iacolippo
Jun 18, 2020

TevenLeScao Jun 19, 2020
Author

lapolonio
Jul 12, 2020

TevenLeScao Jul 13, 2020
Author

TevenLeScao Jul 15, 2020
Author