Sparse Attention

Here're some resources about Sparse Attention

Examples of Sparse Attention Patterns

Intro

While some approaches have introduced heuristics for achieving locality and hierarchical structure within self-attention, another direction explores the sparsity patterns inherent in full attention matrices.

These methods aim to introduce a sparse attention mask, denoted as $M_{\mathcal{S}}$ , where each row i assigns a sparse set of indices $\mathcal{S}_i \subseteq \lbrace j|j < i \rbrace$ that the i-th token attends to. These sparsity-based attention mechanisms offer both computational efficiency and the ability to capture global context information. The figure above provides a visualization of these sparse attention mechanisms.

Fixed Sparsity Patterns

Longnet: Scaling transformers to 1,000,000,000 tokens

paper link: here

citation:

@article{ding2023longnet,
  title={Longnet: Scaling transformers to 1,000,000,000 tokens},
  author={Ding, Jiayu and Ma, Shuming and Dong, Li and Zhang, Xingxing and Huang, Shaohan and Wang, Wenhui and Wei, Furu},
  journal={arXiv preprint arXiv:2307.02486},
  year={2023}
}

DeepSpeed Sparse Attention: Powering 10x longer sequences with 6x faster execution

blog link: here

citation:

@misc{microsoft2020deepspeed,
  author = {Microsoft},
  title = {DeepSpeed Sparse Attention: Powering 10x longer sequences with 6x faster execution},
  year = {2020},
  howpublished = {\url{https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/}},
}

Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting (LogSparse)

paper link: here

citation:

@article{li2019enhancing,
  title={Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting},
  author={Li, Shiyang and Jin, Xiaoyong and Xuan, Yao and Zhou, Xiyou and Chen, Wenhu and Wang, Yu-Xiang and Yan, Xifeng},
  journal={Advances in neural information processing systems},
  volume={32},
  year={2019}
}

Generating long sequences with sparse transformers

paper link: here

citation:

@article{child2019generating,
  title={Generating long sequences with sparse transformers},
  author={Child, Rewon and Gray, Scott and Radford, Alec and Sutskever, Ilya},
  journal={arXiv preprint arXiv:1904.10509},
  year={2019}
}

Adaptive Sparsity Patterns

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

paper link: here

github link: here

citation:

@misc{jiang2024minference10acceleratingprefilling,
      title={MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention}, 
      author={Huiqiang Jiang and Yucheng Li and Chengruidong Zhang and Qianhui Wu and Xufang Luo and Surin Ahn and Zhenhua Han and Amir H. Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu},
      year={2024},
      eprint={2407.02490},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.02490}, 
}

Sparsebert: Rethinking the importance analysis in self-attention

paper link: here

citation:

@inproceedings{shi2021sparsebert,
  title={Sparsebert: Rethinking the importance analysis in self-attention},
  author={Shi, Han and Gao, Jiahui and Ren, Xiaozhe and Xu, Hang and Liang, Xiaodan and Li, Zhenguo and Kwok, James Tin-Yau},
  booktitle={International Conference on Machine Learning},
  pages={9547--9557},
  year={2021},
  organization={PMLR}
}

Not all memories are created equal: Learning to forget by expiring (Expire-Span)

paper link: here

citation:

@inproceedings{sukhbaatar2021not,
  title={Not all memories are created equal: Learning to forget by expiring},
  author={Sukhbaatar, Sainbayar and Ju, Da and Poff, Spencer and Roller, Stephen and Szlam, Arthur and Weston, Jason and Fan, Angela},
  booktitle={International Conference on Machine Learning},
  pages={9902--9912},
  year={2021},
  organization={PMLR}
}

Efficient content-based sparse attention with routing transformers

paper link: here

citation:

@article{roy2021efficient,
  title={Efficient content-based sparse attention with routing transformers},
  author={Roy, Aurko and Saffar, Mohammad and Vaswani, Ashish and Grangier, David},
  journal={Transactions of the Association for Computational Linguistics},
  volume={9},
  pages={53--68},
  year={2021},
  publisher={MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…}
}

Graph Sparsification

Big bird: Transformers for longer sequences

paper link: here

citation:

@article{zaheer2020big,
  title={Big bird: Transformers for longer sequences},
  author={Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others},
  journal={Advances in neural information processing systems},
  volume={33},
  pages={17283--17297},
  year={2020}
}

Star-transformer

paper link: here

citation:

@article{guo2019star,
  title={Star-transformer},
  author={Guo, Qipeng and Qiu, Xipeng and Liu, Pengfei and Shao, Yunfan and Xue, Xiangyang and Zhang, Zheng},
  journal={arXiv preprint arXiv:1902.09113},
  year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse_attn.md

sparse_attn.md

Sparse Attention

Intro

Table of Contents

Fixed Sparsity Patterns

Longnet: Scaling transformers to 1,000,000,000 tokens

DeepSpeed Sparse Attention: Powering 10x longer sequences with 6x faster execution

Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting (LogSparse)

Generating long sequences with sparse transformers

Adaptive Sparsity Patterns

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Sparsebert: Rethinking the importance analysis in self-attention

Not all memories are created equal: Learning to forget by expiring (Expire-Span)

Efficient content-based sparse attention with routing transformers

Graph Sparsification

Big bird: Transformers for longer sequences

Star-transformer

Files

sparse_attn.md

Latest commit

History

sparse_attn.md

File metadata and controls

Sparse Attention

Intro

Table of Contents

Fixed Sparsity Patterns

Longnet: Scaling transformers to 1,000,000,000 tokens

DeepSpeed Sparse Attention: Powering 10x longer sequences with 6x faster execution

Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting (LogSparse)

Generating long sequences with sparse transformers

Adaptive Sparsity Patterns

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Sparsebert: Rethinking the importance analysis in self-attention

Not all memories are created equal: Learning to forget by expiring (Expire-Span)

Efficient content-based sparse attention with routing transformers

Graph Sparsification

Big bird: Transformers for longer sequences

Star-transformer