Hierarchical Attention

Here're some resources about Hierarchical Attention

Intro

To further think of either the global token techniques, or the inter-block attention mentioned in Local Attention, we can regard them as introducing some hierarchical features to self-attention to compensate with more global information from the higher-level attention while keeping the low computation cost from the low-level local attention at the same time. From this view, many works have explored various hierarchical mechanisms that introduce a structured hierarchy into self-attention, leveraging both higher-level global information and lower-level local attention for multi-scaled contextual receptive fields.

Two-Level Hierarchy

Hegel: Hypergraph transformer for long document summarization

paper link: here

citation:

@article{zhang2022hegel,
  title={Hegel: Hypergraph transformer for long document summarization},
  author={Zhang, Haopeng and Liu, Xiao and Zhang, Jiawei},
  journal={arXiv preprint arXiv:2210.04126},
  year={2022}
}

Hierarchical learning for generation with long source sequences

paper link: here

citation:

@article{rohde2021hierarchical,
  title={Hierarchical learning for generation with long source sequences},
  author={Rohde, Tobias and Wu, Xiaoxia and Liu, Yinhan},
  journal={arXiv preprint arXiv:2104.07545},
  year={2021}
}

Lite transformer with long-short range attention

paper link: here

citation:

@article{wu2020lite,
  title={Lite transformer with long-short range attention},
  author={Wu, Zhanghao and Liu, Zhijian and Lin, Ji and Lin, Yujun and Han, Song},
  journal={arXiv preprint arXiv:2004.11886},
  year={2020}
}

Hierarchical transformers for long document classification (HAN)

paper link: here

citation:

@inproceedings{pappagari2019hierarchical,
  title={Hierarchical transformers for long document classification},
  author={Pappagari, Raghavendra and Zelasko, Piotr and Villalba, Jes{\'u}s and Carmiel, Yishay and Dehak, Najim},
  booktitle={2019 IEEE automatic speech recognition and understanding workshop (ASRU)},
  pages={838--844},
  year={2019},
  organization={IEEE}
}

HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization

paper link: here

citation:

@article{zhang2019hibert,
  title={HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization},
  author={Zhang, Xingxing and Wei, Furu and Zhou, Ming},
  journal={arXiv preprint arXiv:1905.06566},
  year={2019}
}

Document-level neural machine translation with hierarchical attention networks

paper link: here

citation:

@article{miculicich2018document,
  title={Document-level neural machine translation with hierarchical attention networks},
  author={Miculicich, Lesly and Ram, Dhananjay and Pappas, Nikolaos and Henderson, James},
  journal={arXiv preprint arXiv:1809.01576},
  year={2018}
}

A discourse-aware attention model for abstractive summarization of long documents

paper link: here

citation:

@article{cohan2018discourse,
  title={A discourse-aware attention model for abstractive summarization of long documents},
  author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},
  journal={arXiv preprint arXiv:1804.05685},
  year={2018}
}

Multi-Level Hierarchy

Combiner: Full attention transformer with sparse computation cost

paper link: here

citation:

@article{ren2021combiner,
  title={Combiner: Full attention transformer with sparse computation cost},
  author={Ren, Hongyu and Dai, Hanjun and Dai, Zihang and Yang, Mengjiao and Leskovec, Jure and Schuurmans, Dale and Dai, Bo},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  pages={22470--22482},
  year={2021}
}

H-transformer-1d: Fast one-dimensional hierarchical attention for sequences

paper link: here

citation:

@article{zhu2021h,
  title={H-transformer-1d: Fast one-dimensional hierarchical attention for sequences},
  author={Zhu, Zhenhai and Soricut, Radu},
  journal={arXiv preprint arXiv:2107.11906},
  year={2021}
}

Bp-transformer: Modelling long-range context via binary partitioning (BPT)

paper link: here

citation:

@article{ye2019bp,
  title={Bp-transformer: Modelling long-range context via binary partitioning},
  author={Ye, Zihao and Guo, Qipeng and Gan, Quan and Qiu, Xipeng and Zhang, Zheng},
  journal={arXiv preprint arXiv:1911.04070},
  year={2019}
}

Adaptive attention span in transformers

paper link: here

citation:

@article{sukhbaatar2019adaptive,
  title={Adaptive attention span in transformers},
  author={Sukhbaatar, Sainbayar and Grave, Edouard and Bojanowski, Piotr and Joulin, Armand},
  journal={arXiv preprint arXiv:1905.07799},
  year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hierarchical_attn.md

hierarchical_attn.md

Hierarchical Attention

Intro

Table of Contents

Two-Level Hierarchy

Hegel: Hypergraph transformer for long document summarization

Hierarchical learning for generation with long source sequences

Lite transformer with long-short range attention

Hierarchical transformers for long document classification (HAN)

HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization

Document-level neural machine translation with hierarchical attention networks

A discourse-aware attention model for abstractive summarization of long documents

Multi-Level Hierarchy

Combiner: Full attention transformer with sparse computation cost

H-transformer-1d: Fast one-dimensional hierarchical attention for sequences

Bp-transformer: Modelling long-range context via binary partitioning (BPT)

Adaptive attention span in transformers

Files

hierarchical_attn.md

Latest commit

History

hierarchical_attn.md

File metadata and controls

Hierarchical Attention

Intro

Table of Contents

Two-Level Hierarchy

Hegel: Hypergraph transformer for long document summarization

Hierarchical learning for generation with long source sequences

Lite transformer with long-short range attention

Hierarchical transformers for long document classification (HAN)

HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization

Document-level neural machine translation with hierarchical attention networks

A discourse-aware attention model for abstractive summarization of long documents

Multi-Level Hierarchy

Combiner: Full attention transformer with sparse computation cost

H-transformer-1d: Fast one-dimensional hierarchical attention for sequences

Bp-transformer: Modelling long-range context via binary partitioning (BPT)

Adaptive attention span in transformers