You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for creating this script, amazing work! I was wondering if you have any plans in creating a convert script for T5 based models, or if you think there are any major difficulties when converting T5 models compared to other architectures.
Thanks,
David
The text was updated successfully, but these errors were encountered:
T5 is planned somehow, but there are some caveats:
T5 relies on a relative positional embedding. It is added to the attention score matrix directly, so you have to compute both Q @ K.T and a relative positional score matrix which is inefficient for very long sequences. This is not the case for BART/Pegasus models.
While relative positional score matrix is not that difficult to compute for local attention, it is not compatible with most LSG sparse attention patterns. There is also no specific rules for global tokens that are prepended.
I'd say that LSG-T5 is much more difficult to build because I have to rethink some things specifically for this model.
If you really need to use T5 right now, there is the LongT5 model on HuggingFace, it is somehow similar to LSG but it is less efficient. It is retrained from scratch, so it is not based on an existing "short" T5 checkpoint.
Thank you! That's very enlightening. Yes, I guess for now using the existing LongT5 and retraining on top of it is the only viable option, instead of using already trained T5s and converting them into LSG.
Hi, thanks for creating this script, amazing work! I was wondering if you have any plans in creating a convert script for T5 based models, or if you think there are any major difficulties when converting T5 models compared to other architectures.
Thanks,
David
The text was updated successfully, but these errors were encountered: