Skip to content

This is a preprocessor/data-cleaner for the WebNLG dataset.

Notifications You must be signed in to change notification settings

zhijing-jin/WebNLG_Reader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 

Repository files navigation

This is an easy-to-use Python reader for the enriched WebNLG data.

How to run

python data/webnlg/reader.py [--version x.x]

--version choices: 1.4 | 1.5 (default)

The resulted file structure is like this:

.
├── data
│   └── webnlg
│       ├── reader.py
│       ├── utils.py
│       ├── raw/
│       ├── test.json
│       ├── train.json
│       └── valid.json
└── README.md

Contributions

  1. Decomposed the WebNLG dataset from document-level into sentence-level
  2. Created an Easy-to-use Python reader for WebNLG dataset v1.5, runnable by 2019-SEP-20. (Debugged and adapted from the reader in chimera's repo.)
  3. Manually fixed spaCy's sentence tokenization
  4. Deleted parts of sentences where no corresponding triple exists.
  5. Deleted irrelevant triples manually
  6. Manually fixed all wrong templates (e.g. template.replace('AEGNT-1', 'AGENT-1')), made it convenient for template-based models.
  7. Carefully replaces - with _ in template names, such as AGENT-1 to AGENT_1. This provides convenience for tokenization.

Overview of dataset

  • Dataset sizes: train 24526, valid 3019, test 6622
  • Vocab of entities: 3227
  • Vocab of ner: 12 (['agent_1', 'bridge_1', 'bridge_2', 'bridge_3', 'bridge_4', 'patient_1', 'patient_2', 'patient_3', 'patient_4', 'patient_5', 'patient_6', 'patient_7'])
  • Vocab of relations: 726
  • Vocab of txt: 6671
  • Vocab of tgt: 1897
  • Len(tgt): avg 11.5, max 42

Todo

  • "was selected by NASA" is a relationship which spans several words, -- it should be made as one word in the triple.
  • "(workedAt," is a relationship which has punctuations, -- it should be clean.
  • There are still several hundred dirty, unaligned (stripleset, template) pairs. Align them by tracking the self.cnt_dirty_data variable when running reader.py.
  • 'discrimina-tive training' spelling errors
  • fix unalignment errors by grep -nriF '<sentence ID="3"/>' '<sentence ID="2"/>'

About

This is a preprocessor/data-cleaner for the WebNLG dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages