This repository contains the companion material for the following publication:
Annemarie Friedrich, Heike Adel, Federico Tomazic, Johannes Hingerl, Renou Benteau, Anika Maruscyk and Lukas Lange. The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain. ACL 2020.
Please cite this paper if using the dataset or the code, and direct any questions to Lukas Lange. The paper can be found at the ACL Anthology or at ArXiv.
This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.
The SOFC-Exp corpus contains 45 scientific publications about solid oxide fuel cells (SOFCs), published between 2013 and 2019 as open-access articles all with a CC-BY license. The dataset was manually annotated by domain experts with the following information:
- Mentions of relevant experiments have been marked using a graph structure corresponding to instances of an Experiment frame (similar to the ones used in FrameNet.) We assume that an Experiment frame is introduced to the discourse by mentions of words such as report, test or measure (also called the frame-evoking elements). The nodes corresponding to the respective tokens are the heads of the graphs representing the Experiment frame.
- The Experiment frame related to SOFC-Experiments defines a set of 16 possible participant slots. Participants are annotated as dependents of links between the frame-evoking element and the participant node.
- In addition, we provide coarse-grained entity/concept types for all frame participants, i.e, MATERIAL, VALUE or DEVICE. Note that this annotation has not been performed on the full texts but only on sentences containing information about relevant experiments, and a few sentences in addition. In the paper, we run experiments for both tasks only on the set of sentences marked as experiment-describing in the gold standard, which is admittedly a slightly simplified setting. Entity types are only partially annotated on other sentences. Slot filling could of course also be evaluated in a fully automatic setting with automatic experiment sentence detection as a first step.
For further information on the annotation scheme, please refer to our paper and the annotation guidelines.
Each article is indexed by its PMC ID (see PubMed Central).
The sofc-exp
directory containing the manually annotated corpus is structured as follows:
sofc-exp-corpus/
annotations/
sentences Binary information per sentence: Does it describe an experiment?
entity_types_and_slots Entity type and slot annotations in BIO format by sentence
(for experiment-describing sentences only!)
frames Full frame-style annotations
tokens Character-offset based stand-off annotations for tokenized text
(created with StanfordCoreNLP)
texts/ Raw texts as extracted from the PDFs, one sentence per line
docs/
annotation_guidelines.pdf
SOFC-Exp-Metadata.csv Additional metadata / references and links to original documents
The folders sentences
, and entity_types_and_slots
contain information derived from the
original frame_annotations
, as used in our ACL 2020 paper.
-
sentences
contains one file per article with each line describing one sentence:sentence_id, label, begin_char_offset, end_char_offset
. The binary label (0 or 1) per line corresponds to whether the sentence describes an SOFC-related experiment or not. We simply considered all sentences containing at least one frame-evoking element to be an experiment-describing sentence (label 1). The character offsets expressing the start end end offsets of each sentence refer to the respective files in thetexts
directory. Note that it is mostly the case that each line in these files contains one sentence, but there are several cases where sentence annotations include line breaks. Hence, do always use the sentence annotations given here when working with the other annotation levels! Sentence tokenization was performed by Java's built-in BreakIterator.getSentenceInstance with US locale. -
tokens
contains stand-off annotations for the tokens in the original texts. Tokenization was done with Stanford CoreNLP. The file format is as follows:sentence_id, token_id, begin_char_offset, end_char_offset
. Columns are separated by tabs. Character offsets always refer to the start of the respective sentence. Counts forsentence_id
andtoken_id
start at 1. -
entity_types_and_slots
contains the entity type and experiment slot info for the subset of sentences that describe SOFC-related experiments, i.e., the sentences labeled with 1 above. This information can also be extracted from theframes
file, we provide it here for convenience. The file format is as follows:sentence_id, token_id, begin_char_offset, end_char_offset entity_label slot_label
. The columns forentity_types
andslots
uses a BIO format. Columns are tab-separated. -
The full frame annotation is represented as follows in
frames
. The frames annotated for each article are represented in a tab-separated file. First, the files list all annotated text spans with lines prefixed withSPAN
. The second column corresponds to the span ID, the third to its entity/concept type label orEXPERIMENT
followed by colon and a more specific experiment mention type. The fourth column refers to the sentence ID (line of sentence in text file, counts start at 1), the fifth and sixth column represents the character offsets of the begin and end of the span within the sentence. The last column adds the text corresponding to the span (for debugging/readability purposes).... SPAN 43 MATERIAL 19 34 37 SFM SPAN 44 EXPERIMENT:previous_work 19 43 52 displayed SPAN 45 MATERIAL 19 120 123 air SPAN 46 VALUE 19 125 136 860 S cm−1 SPAN 47 MATERIAL 19 142 150 hydrogen SPAN 48 VALUE 19 152 162 48 S cm−1 SPAN 49 VALUE 19 180 191 400600 oC9 ...
-
After the spans, the file lists each experiment frame instance Frame instances start with
EXPERIMENT
, followed by an experiment ID in the second column and the span ID of the corresponding frame-evoking element in the third column. The following lines, in which the first column is empty, list the slots of the frame. The second column gives the label of the frame participant (slot) and the last column refers to the span ID of the dependent/slot.... EXPERIMENT 10 44 anode_material 43 fuel_used 45 conductivity 46 fuel_used 47 conductivity 48 working_temperature 49 ...
-
After the experiment frame instances, the file lists the additional annotations available with our corpus. Each of these lines starts with
LINK
, giving the label of the relation in the second column, and two span IDs referring to the start and end span respectively. Note that links labeledsame_experiment
andexperiment_variation
are here annotated as links between experiment-evoking mentions, but conceptually they indicate links between the respectiveEXPERIMENTS
.... LINK thickness 66 65 ... LINK experiment_variation 98 94 LINK same_experiment 100 103 LINK coreference 2 3 LINK coreference 12 17 ...
We ran our experiments using Python 3.8. You need the following conda packages: torch
, numpy
and scikit-learn
, and the pip package transformers
(by Huggingface), gensim
(version < 4) and sentencepiece
.
See also the exported conda environment (sofcexp.yml
at the top level of the project).
Edit 12/21: The code has been fixed to work with PyTorch 1.8+.
Download the pretrained word2vec, mat2vec and bpe embeddings and place them in data/embeddings
. If you prefer a different storage location, update the values of the command-line parameters embedding_file_word2vec
, embedding_file_mat2vec
, embedding_file_bpe
in main_preprocess.py
, accordingly.
word2vec and bpe embeddings are expected in .bin format; for mat2vec embeddings, you will need the whole content of the folder mat2vec/training/models/pretrained_embeddings
from the mat2vec project.
bpe embeddings must be of vocab size 200,000 and dimensionality 300.
Furthermore, the bpe model with vocab size 200,000 is required and must also be placed in data/embeddings
.
Run the script main_preprocess.py
. It will reduce the embeddings to the corpus vocabulary and create word-to-embedding-index files.
The reduced embeddings will be stored as .npy files, the word2index files as pickle files.
The default storage place is again data/embeddings but can be changed via command-line arguments of main_preprocess.py
.
Place the PyTorch SciBERT model into models/SciBERT/scibert_scivocab_uncased
.
Make sure this directory contains the files config.json
, pytorch_model.bin
, and vocab.txt
.
If you are using a different BERT model, adapt the value of the parameter pretrained_bert
, see main.py
.
See scripts in scripts
folder for configurations for replicating our ACL 2020 experiments.
After your jobs for the individual runs of the CV folds have finished, run the file
python source/evaluation/evaluate_cross_validation.y
with appropriate command line parameters (see this file for arguments) to collect the predictions from the runs and compute aggregate statistics.
The code in this repository is open-sourced under the AGPL-3.0 license. See the LICENSE file for details. For a list of other open source components included in this project, see the file code/3rd-party-licenses.txt.
The manual annotations created for the SOFC-Exp corpus located in the folder sofc-exp-corpus/annotations are licensed under a Creative Commons Attribution 4.0 International License (CC-BY-4.0).