[Paper + tutorial] The Johnson-Lindenstrauss lemma & Linformer #7
Replies: 2 comments 5 replies
-
Hi Teven, my name is Iacopo. Thanks for starting this discussion and for the really nice blog post :-) I’m not an expert in NLP but some members of the ML team @lightonai are looking into what can be done with random matrices in language tasks. At first glance, I thought that this looks similar to sketched matrix multiplication: finding a matrix S such that the expectation of A^T S^T S B = A^T B. If we remove the softmax from the picture in equation 8 in the paper, we get back to this, I believe. This would point to the fact that sharing the projections E and F is not detrimental, and maybe that a random S (or E, or F) is enough? I’m not sure if having a different random matrix for each attention head would be beneficial or not. Note that if you choose S and its dimensions properly, and you have a fast way of computing products with S, then you can reduce the computation time by a lot! (we think we have a fast way to do this at LightOn, wink wink). For more about sketched matrix multiplication Sketching as a Tool for Numerical Linear Algebra by David Woodruff, section 2.2. I think it is also covered in the fast.ai course on numerical linear algebra. About placing random matrices in other locations, there has been some work recently on using random attention matrices in the Synthesizer paper https://arxiv.org/abs/2005.00743 and the LSH in the Reformer: Efficient Transformer https://arxiv.org/abs/2001.04451 is a random-matrix-based hashing scheme. Really looking forward to seeing other people's comments! |
Beta Was this translation helpful? Give feedback.
-
Hello, I tried reimplementing this paper because I thought it would be "easy" and I learned a lot. I looked at the SST-2 dataset and the max length after tokenization is 72. So when I'm also not convinced that the long sequence analysis (> 8000) is practical. No training results were reported, and it's not clear what level of training is required to get extreme sequence lengths to converge or perform well on downstream tasks. |
Beta Was this translation helpful? Give feedback.
-
Hey everyone and welcome to Hugging Face reading group! We're trying out the new Github discussions to share papers discussions with the community.
This week will be about Linformer, a very recent paper that breaks the quadratic complexity bottleneck of standard Transformers, and the Johnson-Lindenstrauss lemma, a key high-dimensional geometry result that serves as a dimensionality reduction powerhouse. I wrote a tutorial blog post: it's designed to give the reader intuition about high-dimensional phenomena of concentration of measure to introduce the JL lemma, then shows how Linformer uses this result to do hyper-efficient NLP.
We'd like to use this space to foster research discussions for the community. Here's a few questions we had after reading the paper as a starting point. Feel free to use this thread or to open new ones to discuss them or to open your own debates; we're figuring out what the best format is as we go :)
Beta Was this translation helpful? Give feedback.
All reactions