This project provides deepa2
, which
- 🥚 takes NLP data (e.g. NLI, argument mining) as ingredients;
- 🎂 bakes DeepA2 datatsets conforming to the Deep Argument Analysis Framework;
- 🍰 serves DeepA2 data as text2text datasets suitable for training language models.
There's a public collection of 🎂 DeepA2 datatsets baked with deepa2
at the HF hub.
The Documentation describes usage options and gives background info on the Deep Argument Analysis Framework.
- Install
deepa2
into your ML project's virtual environment, e.g.:
source my-projects-venv/bin/activate
python --version # should be ^3.7
python -m pip install deepa2
- Add
deepa2
preprocessor to your training pipeline. Your training script may look like, for example:
#!/bin/bash
# configure and activate environment
...
# download deepa2 datasets and
# prepare for text2text training
deepa2 serve \
--path some-deepa2-dataset \ # <<< 🎂
--export_format csv \
--export_path t2t \ # >>> 🍰
# run default training script,
# e.g., with 🤗 Transformers
python .../run_summarization.py \
--train_file t2t/train.csv \ # <<< 🍰
--text_column "text" \
--summary_column "target" \
--...
# clean-up
rm -r t2t
- That's it.
Install poetry.
Clone the repository:
git clone https://github.com/debatelab/deepa2-datasets.git
Install this package from within the repo's root folder:
poetry install
Bake a DeepA2 dataset, e.g.:
poetry run deepa2 bake \\
--name esnli \\ # <<< 🥚
--debug-size 100 \\
--export-path ./data/processed # >>> 🎂
We welcome contributions to this repository, especially scripts that port existing datasets to the DeepA2 Framework. Within this repo, a code module that transforms data into the DeepA2 format contains
- a Builder class that describes how DeepA2 examples will be constructed and that implements the abstract
builder.Builder
interface (such as, e.g.,builder.entailmentbank_builder.EnBankBuilder
); - a DataLoader which provides a method for loading the raw data as a 🤗 Dataset object (such as, for example,
builder.entailmentbank_builder.EnBankLoader
) -- you may usedeepa2.DataLoader
as is in case the data is available in a way compatible with 🤗 Dataset; - dataclasses which describe the features of the raw data and the preprocessed data, and which extend the dummy classes
deepa2.RawExample
anddeepa2.PreprocessedExample
; - a collection of unit tests that check the concrete Builder's methods (such as, e.g.,
tests/test_enbank.py
); - a documentation of the pipeline (as for example in
docs/esnli.md
).
Consider suggesting to collaboratively construct such a pipeline by opening a new issue.
This repository builds on and extends the DeepA2 Framework originally presented in:
@article{betz2021deepa2,
title={DeepA2: A Modular Framework for Deep Argument Analysis with Pretrained Neural Text2Text Language Models},
author={Gregor Betz and Kyle Richardson},
year={2021},
eprint={2110.01509},
archivePrefix={arXiv},
primaryClass={cs.CL}
}