Skip to content

murattasdemir/CANarEx

 
 

Repository files navigation

CANarEx pipeline

Runs on Linux and macOS using Python 3.9.5

Factiva and Hansard 'First Nations' dataset

  • CaNarEx environment

     cd CaNarEx
     python3 -m venv venv_canarex
     source venv_canarex/bin/activate
     pip install -r requirements.txt

Step1: Split data into sentences

  • Use CaNarEx environment
  • Run split_sentences_trf.py (data already provided)
        python 1.split_sentences_trf.py

Step2: Coreference resolution

Using SpanBERT

  • Download https://github.com/mandarjoshi90/coref and follow installation instructions from "Jonathan K. Kummerfeld's notebook" ('spanbert_base') into coref_env environment
  • Install following packages into coref_env:
        pip install tokenization
        pip install sacremoses
  • Run coreference resolution python python 2.coref_bert.py

Step3: SRL extraction

  • Use CaNarEx environment
  • Run run_canarex.py
    python 3.run_canarex.py

Step4: Filtering narratives

TopN clustering (document level clustering) and Textrank clustering

 python 4.clustering.py

Evaluation

The evaluation folder contains generation of synthetic test data for narrative time-series clustering using jupyter notebook.

Reference (Baseline: Relatio)

    python 5.run_relatio.py

References

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 76.8%
  • Python 23.2%