-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Is Linformer permutation equivariant (set-operation)? #26
Comments
Hey @nmakes! Originally, when I used the linformer, I also used it for a similar task (unstructured data). I made a report on it, but what I found out is that it was as effective as other sparse attention models. So I think it should work 🙂 |
Hey @tatp22, thanks for the answer. Interesting! Could you please give a little more intuition on why you think it worked (about what task it was, and if there are any caveats)? :) I'm actually seeing a clear regression in my task. Your insights would be super useful! Thanks! |
Hey @nmakes! My task was similar to yours. I had a 3d point cloud, with every data point representing the 3d point cloud at different points in time, with a vector of about 7 data points for each 3d point. My 3d points were sometimes batched together on a dimension (for example, I would group 5 points in time on the x dimension), so that I can get temporal information integrated into my prediction. My task was to predict the future properties of this data cloud. The times were batched, for example, in one hour intervals, and I had to predict 72 hours in the future and see the results. Why did I think it worked? This is because I think that the model learned the most important relations between points on its own. This is why I think that it isn't so important what kind of data is fed into the model, as the model will more often than not find the regression on it's own. Let me know if you have any more questions! |
Hey @tatp22, Thank you so much for the details! :) Q1: Just to clarify, did you apply attention for each point independently over its own 5-previous timesteps? Or was the attention applied over other points as well (e.g. Nx5 queries)? It does makes sense to apply attention over past timesteps for each point independently in your example, where the task is to predict future timesteps for that particular point. But, referring to my earlier question, I'm trying to understand why linformer attention would work on unordered points. Here's a small experiment I did. TL;DR: Changing the order of points, changes the outputs of the transformer:
Suppose we have point cloud with 5 3d points:
Now, we swap the 0th and 4th index points in x:
Note, we only swapped the first and the last tensors. The point cloud remains the same, however, passing it through the transformer changes the features, even for the points that were not swapped (idx=1 to idx=3).
This is why I'm finding it a little hard to understand how to make Linformer work for unstructured data. Q2: Did you mean that even under such behavior, Linformer is expected to improve the representations for the task? If so, how do we handle inference where the ordering can be random (different results for the same scene based on how input is fed each time?). PS: With the same code, setting |
Ah, ok, I understand your points now. To give you an answer, I did the second, Nx5 version, so there were a lot of points! As you might probably have guessed, normal attention would be too big, so I resorted to sparse attention as it helped me there. Q1: See #15 for more information about this. TLDR, yes the internal downsampling does scatter the data around, so this property is not guaranteed. I am not sure if it would work for your task, but have you tried encoding positional data into the model? Perhaps with my other repository? https://github.com/tatp22/multidim-positional-encoding 🙂 But I think that achieving this equivariance property is (I think) hard, if not impossible with linear attention, because whatever method you choose to use, I think that there will be some information that is necessarily lost with whatever downsampling method you use. What's nice about attention is that you compare all the information of every point with every other point, which is why I think equivariance is possible. Unless you keep that guarantee with linear attention, which this repo doesn't due to downsampling, then it is gone. (ps: try setting k=n. You might get equivariance then, depending on the sampling method!) Q2: Yes, It should! I think that the power here comes from the fact that there are so many parameters in the model that the linformer learns about the relationships anyways, due to the Q and V matrices holding redundant information. While learning, if you put in these points in a different order, I think that the model should still be powerful enough to see relationships due to the sheer number of params. I hope this helps! |
Hi. Thanks for the wonderful implementation!
I was wondering if linformer can be used with any unordered set of tensors (or is it just sequence data?). Specifically, is linformer permutation equivariant?
I'm looking to apply linear attention on points in 3d space (e.g. a point cloud with ~100k points). Would linformer attention be meaningful?
(I'm concerned about the n -> k projection, which assumes the n points in some order if I understand correctly)
Thanks!
The text was updated successfully, but these errors were encountered: