This repository provides full-text and metadata to the ACL anthology collection (80k articles/posters as of September 2022) also including .pdf files and grobid extractions of the pdfs.
- We provide pdfs, full-text, references and other details extracted by grobid from the PDFs while ACL Anthology only provides abstracts.
- There exists a similar corpus call ACL Anthology Network but is now showing its age with just 23k papers from Dec 2016.
The goal is to keep this corpus updated and provide a comprehensive repository of the full ACL collection.
This repository provides data for 80,013
ACL articles/posters -
- 📖 All PDFs in ACL anthology : size 45G download here
- 🎓 All bib files in ACL anthology with abstracts : size 172M download here
- 🏷️ Raw grobid extraction results on all the ACL anthology pdfs which includes full text and references : size 3.6G download here
- 💾 Dataframe with metadata and full text of the collection for analysis : size 503M download here
>>> import pandas as pd
>>> pd.read_parquet('acl_corpus_full-text.parquet')
acl_id title abstract full_text
0 P83-1025 Discourse P... This paper ... This paper ...
1 P93-1015 Parsing Fre... There is a ... There is a ...
2 P82-1017 REFLECTIONS... Our society...
3 C86-1129 A Lexical F... This paper ... This paper ...
4 C80-1093 AUTHOR INDE...
.. ... ... ... ...
95 2022.hcinlp... Introductio...
96 2021.tacl-1.9 On the Rela... We use larg... We use larg...
97 2022.nlp4co... Understandi... Exemplar-ba... Exemplar-ba...
98 2022.nlp4co... Stylistic R... Personality... Personality...
99 2022.nlp4co... Toward Know... Conversatio... Conversatio...
The provided ACL id is consistent with S2 API as well -
https://api.semanticscholar.org/graph/v1/paper/ACL:P83-1025
The API can be used to fetch more information for each paper in the corpus.
TODO:
- Link the acl corpus to semantic scholar(S2), sources like S2ORC
- Extract figures and captions from the ACL corpus using pdffigures.
- Have a release schedule to keep the corpus updated.
- ACL citation graph
- Enhance metadata with bib file mapping - include authors
- Add citation counts for papers
We are hoping that this corpus can be helpful for analysis relevant to the ACL community.
Please cite/star 🌟 this page if you use this corpus
ACL anthology corpus is released under the CC BY-NC 4.0. By using this corpus, you are agreeing to its usage terms.