A cross-parallel-dataset approach to frame classification at variable granularity levels

With this code, you can reproduce the results of the paper "A cross-parallel-dataset approach to frame classification at variable granularity levels" In this file, we explain how you can set up and use this application.

Set up

We recommend to use Python 3.8.

Required Libraries

Natural language toolkit: pip install nltk (we applied version 3.4.5)
- we need certain NLTK data. However, this is done in the code itself
Tensorflow (the use of GPU is optional): pip install tensorflow (we applied version 2.3.0)
for logging: pip install loguru (we applied version 0.4.1)
Word-Movers-Distance-implementation: pip install word-mover-distance (we applied version 0.0.1)
for the plots: pip install matplotlib (we applied version 3.2.2)

And basic libraries:

pip install numpy (we applied version 1.19.1/ installed with tensorflow, if there are some CPU/GPU-errors try to pip install --upgrade tensorflow)

Datasets

We use two datasets

The Webis-Argument-Dataset

See this homepage

Yamen Ajjour, Milad Alshomary, Henning Wachsmuth, and Benno Stein. Modeling Frames in Argumentation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP 2019), pages 2922-2932, November 2019. ACL.

The Media-Frames-Dataset

See this GitHub-reference

Card, Dallas, et al. "The media frames corpus: Annotations of frames across issues." Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015.

Unfortunately, we can't provide the dataset in this repository due to license issues.

Important files for evaluation

In principle, there is a pre-defined step: the creation of the dataset and then the evaluation on it. There are scripts to convert the raw datasets in a generalized format, marked with _out.csv. The preprocessor-scripts are in the folder Corpora. A crucial script is the UserFrames2GenericFrames.py

There are two main files for evaluation.

`BiLSTMApproach.py`

This file is for the single setup. Please execute the file with python3 BiLSTMApproach.py

`BiLSTMApproach_MultiTask.py`

This file is for the cross-parallel-dataset setup. Please execute the file with python3 BiLSTMApproach_MultiTask.py

`TrainGenereicFrames_ALBERT2.py`

Of course, we tried to apply the modern pre-trained transformers, too. We used the ALBERT from the huggingface-library to this end. However, this does not lead to acceptable results. Hence, we do not consider this option further.

Important parameters

Besides to the comment block at the beginning of the files, we want to present the important parameters here.

Predict the right output

We implemented several modes for predicting an output. We present the modes now.

Predict the embedded user label (token by token)

This mode is activated with using_word_embedding_output = True. In this mode, we predict max_seq_len_output times embedding_size_outputd vectors which represents the tokenized frame embedded with pre-computed word embeddings which are defined with word_embeddings_output.

Predict the Frames-set-classes

To this end, frames is set to a GenericFrame-instance in the Frames.py. In this mode, we predict the generic frame. However, if the label does not match exactly to a frame of the Frames-set, we discard the sample if filter_unknown_frames = True or label it with the unknown-Frame. This mode is not recommended in combination with the Webis-dataset.

Predict the mapped Frames-set-classes

To this end, frames=None and one_hot_output_clusters should be defined with an GenericFrame-instance in the Frames.py. The output-vector is determined by the Word-movers-distance.

Predict the right cluster (k-means-algorithm by nltk)

To this end, frames=None and one_hot_output_clusters should be defined with an integer representing the k.

The semantic clustering is activated by default.

Further parameters for both settings (single and cross-parallel-dataset)

in cross-parallel-dataset setting, the following parameters are applied to each task input and output.

`max_seq_len`

We described in our paper a fixed length for each input (premise+conclusion). Here we can define this length. Inputs with a smaller length will be padded, inputs with a longer length will be discarded.

`enable_fuzzy_framing`

This boolean flag enables with a True-value the fuzzy framing which means the disabling of the one-hot-encoding. For example, consider an input that belongs to 80% to the first frame class and to 20% the second one.

If enable_fuzzy_framing is:

True: we want to predict [0.8 0.2 ...]
False: we want to predict [1 0 ...]

`using_topic`

A boolean Flag to either include the topic in the input to the learning model or not.

`using_premise`

A boolean Flag to either include the premise in the input to the learning model or not.

`using_conclusion`

A boolean flag to either include the conclusion in the input to the learning model or not.

`filter_unknown_frames`

A boolean flag which can be activated for a post-filtering. Normally, we filter in the step of the dataset creation. However, this variable acn be used to filter frames which are not occurring in the defined generic frame-class-set.

`word_embeddings` + `embedding_size`

Here is the possibility to define the used pre-computed word embeddings. Must be stored in a txt file. The embedding size is a integer which represents the dimensionality of the word embeddings.

We recommend using the GloVe-Word-Embeddings

`NN_which_used`

We offer three neural net architectures:

BiLSTM: a bidirectional neural net which has LSTMs as core layer
BiGRU: a bidirectional neural net which has GRUs as core layer
CNN: a convolutional neural net (without recurrent layers)

`data_set`

The used dataset

Further parameters for the cross-parallel-dataset setting

The other meaning of `frames`

In the single-setting, frames activates the strict Frame-set-classes. However, in the Multi-Task-setting, frames defines the output classes for the Media-Frames-Task and does not influence the output for the Webis-task.

`give_webis_extra_layers`

This boolean flag controls the architecture on the Webis-task-side. If True, Webis gets some additional layers. To be more specific, Webis gets

an additional Dropout-layer with a rate of 0.66
an additional Dense-layer

`soft_parameter_sharing_lambda`

This parameter expects a float in the range from 0 (exclusive) to 1 (inclusive). It controls the parameter sharing mode:

soft_parameter_sharing_lambda < 1: soft-parameter-sharing
soft_parameter_sharing_lambda = 1: hard-parameter-sharing

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Corpora		Corpora
clusters		clusters
BiLSTMApproach.py		BiLSTMApproach.py
BiLSTMApproach_MultiTask.py		BiLSTMApproach_MultiTask.py
Frames.py		Frames.py
Losses.py		Losses.py
Metric_UserLabel.py		Metric_UserLabel.py
ReadMe.md		ReadMe.md
Test.py		Test.py
TrainGenereicFrames_ALBERT2.py		TrainGenereicFrames_ALBERT2.py
Utils.py		Utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A cross-parallel-dataset approach to frame classification at variable granularity levels

Set up

Required Libraries

Datasets

The Webis-Argument-Dataset

The Media-Frames-Dataset

Important files for evaluation

`BiLSTMApproach.py`

`BiLSTMApproach_MultiTask.py`

`TrainGenereicFrames_ALBERT2.py`

Important parameters

Predict the right output

Predict the embedded user label (token by token)

Predict the Frames-set-classes

Predict the mapped Frames-set-classes

Predict the right cluster (k-means-algorithm by nltk)

Further parameters for both settings (single and cross-parallel-dataset)

`max_seq_len`

`enable_fuzzy_framing`

`using_topic`

`using_premise`

`using_conclusion`

`filter_unknown_frames`

`word_embeddings` + `embedding_size`

`NN_which_used`

`data_set`

Further parameters for the cross-parallel-dataset setting

The other meaning of `frames`

`give_webis_extra_layers`

`soft_parameter_sharing_lambda`

About

Releases

Packages

Languages

phhei/FramingNN

Folders and files

Latest commit

History

Repository files navigation

A cross-parallel-dataset approach to frame classification at variable granularity levels

Set up

Required Libraries

Datasets

The Webis-Argument-Dataset

The Media-Frames-Dataset

Important files for evaluation

BiLSTMApproach.py

BiLSTMApproach_MultiTask.py

TrainGenereicFrames_ALBERT2.py

Important parameters

Predict the right output

Predict the embedded user label (token by token)

Predict the Frames-set-classes

Predict the mapped Frames-set-classes

Predict the right cluster (k-means-algorithm by nltk)

Further parameters for both settings (single and cross-parallel-dataset)

max_seq_len

enable_fuzzy_framing

using_topic

using_premise

using_conclusion

filter_unknown_frames

word_embeddings + embedding_size

NN_which_used

data_set

Further parameters for the cross-parallel-dataset setting

The other meaning of frames

give_webis_extra_layers

soft_parameter_sharing_lambda

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`BiLSTMApproach.py`

`BiLSTMApproach_MultiTask.py`

`TrainGenereicFrames_ALBERT2.py`

`max_seq_len`

`enable_fuzzy_framing`

`using_topic`

`using_premise`

`using_conclusion`

`filter_unknown_frames`

`word_embeddings` + `embedding_size`

`NN_which_used`

`data_set`

The other meaning of `frames`

`give_webis_extra_layers`

`soft_parameter_sharing_lambda`

Packages