README

(scisumm-corpus @ https://github.com/WING-NUS/scisumm-corpus)

This package contains a release of training and test topics to aid in the development of computational linguistics summarization systems.

Final system reports of the recently concluded edition of Shared Task, CL-SciSumm '18 @ SIGIR '18 can be found in BIRNDL Proceedings: http://ceur-ws.org/Vol-2132/ under the header 'Sytem Papers'

The CL-SciSumm Shared Task is run off the CL-SciSumm corpus, and comprises three sub-tasks in automatic research paper summarization on a new corpus of research papers. A training corpus of forty topics has been released. A test corpus of ten topics will be released. The topics comprise of ACL Computational Linguistics research papers, and their citing papers and t hree output summaries each. The three output summaries comprise: the traditional self-summary of the paper (the abstract), the community summary (the collection of citation sentences ‘citances’) and a human summary written by a trained annotator. Within the corpus, each citance is also mapped to its referenced text in the reference paper and tagged with the information facet it represents. We plan to further enrich this dataset with the AAN metafeatures and other meta-descriptors developed by researchers at DERI, National University of Ireland.

For more details, see the Contents Section at the bottom of this Readme. To know how this corpus was constructed, please see ./docs/corpusconstruction.txt

Results of the CL-SciSumm-18 will be released in the BIRNDL workshop collocated with ACM SIGIR 2018, Ann Arbor, MI, USA. Go to task website.

If you use the data and publish please let us know and cite our CL-SciSumm 2016 task overview paper:
@inproceedings{jaidka2016overview,
title={Overview of the CL-SciSumm 2016 Shared Task},
author={Jaidka, Kokil and Chandrasekaran, Muthu Kumar and Rustagi, Sajal and Kan, Min-Yen},
booktitle={In Proceedings of Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries (BIRNDL 2016)},
year={2016}
}

README for The 4th Computational Linguistics Scientific Document Summarization Shared Task Corpus (CL-SciSumm 2018)

March 18, 2018

Please read further for details on the Computational Linguistics Shared Task run as part of BIRNDL 2018 workshop collocated with SIGIR 2018 - official website hosted at: http://wing.comp.nus.edu.sg/~cl-scisumm2018

Final system reports of the Shared Task can be found in BIRNDL Proceedings: http://ceur-ws.org/Vol-2132/ under the header 'Sytem Papers'

Overview

You are invited to participate in the CL-SciSumm Shared Task at BIRNDL 2018. The shared task will be on automatic paper summarization in the Computational Linguistics (CL) domain. The output summaries will be of two types: faceted summaries of the traditional self-summary (the abstract) and the community summary (the collection of citation sentences ‘citances’). We also propose to group the citances by the facets of the text that they refer to.

This task follows up on the successful previous editions at SIGIR 2017, JCDL 2016 and the Pilot Task conducted as a part of the BiomedSumm Track at the Text Analysis Conference 2014 (TAC 2014). It follows the basic structure and guidelines of the Biomedical Summarization Track and adapts them for annotating and creating a corpus of training topics from computational linguistics research papers.
The task is defined as follows:

Given: A topic consisting of a Reference Paper (RP) and ~~upto 10~~ Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP.

Task 1a: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance. These are of the granularity of a sentence fragment, a full sentence, or several consecutive sentences (no more than 5).
Task 1b: For each cited text span, identify what facet of the paper it belongs to, from a predefined set of facets.
Task 2 (optional bonus task): Finally, generate a structured summary of the RP from the cited text spans of the RP. The length of the summary should not exceed 250 words.

Evaluation: Task 1 will be scored by overlap of text spans measured by number of sentences in the system output vs gold standard. Task 2 will be scored using the ROUGE family of metrics between i) the system output and the gold standard summary fromt the reference spans ii) the system output and the asbtract of the reference paper. Again, Task 2 is optional.

This directory contains the source document for the RP of the topic in xml format in UTF-8 character encoding. The file corresponds to the similarly named pdf file in Documents_PDF/. All annotations and offsets for the topic are with respect to the xml files in this directory. All the files were created from the pdf file using Adobe Acrobat.
Note that there were OCR errors in reading several of the files, and the annotators often had to manually edit the converted txt files. Research groups using are free to use alternative parsing tools on the pdfs provided, if they are found to perform better.

./data/???-????_TRAIN/CITANCE_XML/

This directory contains the source document for the CPs of the topic in xml format in UTF-8 character encoding. Each file corresponds to the similarly named pdf file above.

./data/???-????_TRAIN/Annotation/

This directory contains the annotation files for the topic, from 3 different annotators.
Please DO NOT use older annotations; only use .annv3.txt for the 2016 Shared Task.

./data/???-????_TRAIN/summary/

The summary task (Task 2) is an optional, "bonus" task which participants may want to attempt. This directory contains the two kinds of summaries - i. the abstract, and ii.the reference spans (not citances but the information they referenced in the source paper). Both are extractive summaries. For the developemnt sets we will release in April, we will include a third type of summary - hand-written a nnotator summaries. These would be abstractive.

Annotation

Given a reference paper (RP) and 10 or more citing papers (CPs), annotators from the University of Hyderbad were instructed to find citations to the RP in the CPs. Annotators followed instructions in SciSumm-annotation-guidelines.pdf to mark the Citation Text, Citation Marker, Reference Text, and Discourse Facet for each citation of the RP found in the CP.

Organisers' Contacts

For further information about this data release, contact the following members of the BRNDL 2017 workshop organising committee:

Kokil Jaidka (University of Pennsylvania) kokil.j@gmail.com
Muthu Kumar Chandrasekaran (Dept. of Computer Science, School of Computing, National University of Singapore) cmkumar087@gmail.com
Michihiro Yasunaga (Computer Science, Yale University) michihiro.yasunaga@yale.edu
Dragomir Radev (Computer Science, Yale University), dragomir.radev@yale.edu
Min-Yen Kan (Dept. of Computer Science, School of Computing, National University of Singapore) kanmy@comp.nus.edu.sg

This README was updated from README2017 by Muthu Kumar Chandrasekaran in March, 2018. For revision information, check source code control logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README2018.md

README2018.md

README

(scisumm-corpus @ https://github.com/WING-NUS/scisumm-corpus)

README for The 4th Computational Linguistics Scientific Document Summarization Shared Task Corpus (CL-SciSumm 2018)

Overview

Contents

Annotation

Organisers' Contacts

Files

README2018.md

Latest commit

History

README2018.md

File metadata and controls

README

(scisumm-corpus @ https://github.com/WING-NUS/scisumm-corpus)

README for The 4th Computational Linguistics Scientific Document Summarization Shared Task Corpus (CL-SciSumm 2018)

Overview

Contents

Annotation

Organisers' Contacts