Wiki-TabNER

This repository contains the dataset and the code for the paper Wiki-TabNER:Integrating NER in Wikipedia tables. The link for downloading the paper can be provided per request.

Motivated by the lack of more complex tables in the existing datasets commonly used for table interpretation tasks, we propose a new dataset annotated with NERs within tables.

Example table

Here is an example table how the tables in the Wiki-TabNER dataset are annotated.

The raw table representation in the WikiTable corpus contains information about the linked entities in the cells. Here is how the table from the example is shown in the original corpus.
After the text from the cells, 'tdHtmlString' lists all the links of the linked entities within this cell.
The 'surfaceLinks' indicate the start and end of the linked entity, the title in Wikipedia and the surface text of the entity.

'_id': '17499143-1',
'pgTitle': 'Angela Maxwell' ,
'sectionTitle': 'Programs',
'tableCaption': 'Programs',
'tableHeaders': [ {'text': 'Season', ...},
                  {'text': 'Short Program', ...},
                  {'text': 'Free Skating', ...},
                  {'text': 'Exhibition', ...},
                  {'text': '2009-2010', ...}]
'tableData': [[{'text': '2009-2010', 'tdHtmlString':'2009-2010 </th>', 'surfaceLinks': []},
{'text': 'Santa Maria (del Buen Ayre) by Gotan Project Libertango by Ástor Piazzolla',
 'tdHtmlString': '<a href="http://www.wikipedia.org/wiki/La_Revancha_del_Tango" shape="rect">Santa Maria (del Buen Ayre)</a> by
                  <a href="http://www.wikipedia.org/wiki/Gotan_Project" shape="rect">Gotan Project</a>
                  <a href="http://www.wikipedia.org/wiki/Libertango" shape="rect">Libertango</a> by
                  <a href="http://www.wikipedia.org/wiki/Ástor_Piazzolla" shape="rect">Ástor Piazzolla</a> </td>',
 'surfaceLinks': [{'offset': 0, 'endOffset': 1,  'target': {'id': 4683862, 'language': 'en', 'title': 'La_Revancha_del_Tango'}, 'surface': 'Santa Maria (del Buen Ayre)'},
                  {'offset': 31, 'endOffset': 44,  'target': {'id': 1054664, 'language': 'en', 'title': 'Gotan_Project', 'redirecting': False, 'namesapce': 0}, 'surface': 'Gotan Project'},
                  {'offset': 45, 'endOffset': 55,  'target': {'id': 18223717, 'language': 'en', 'title': 'Libertango', 'redirecting': False, 'namesapce': 0}, 'surface': 'Libertango'},
                  {'offset': 15, 'endOffset': 30,  'target': {'id': 44903, 'language': 'en', 'title': 'Ástor_Piazzolla', 'redirecting': False, 'namesapce': 0}, 'surface': 'Ástor Piazzolla'}],
                }

Below we show the transformed table in Wiki_TabNER. Note that in the WikiTables corpus, the 'offset' and 'endOffset' positions of the entities are not always correct (for ex for the first and last entity in the cell). When adding the annotations for the named entities, we make sure to look for the position of the 'surface' text of the entity in the cell text and we add these positions as the start and end of the named entity.

['17499143-1',
  'Angela Maxwell',
  'Programs',
  'Programs',
  [[[-1, 0], 'Season'],
   [[-1, 1], 'Short Program'],
   [[-1, 2], 'Free Skating'],
   [[-1, 3], 'Exhibition']],
  [[[0, 0], '2009-2010'],
   [[0, 1], 'Santa Maria (del Buen Ayre) by Gotan Project Libertango by Ástor Piazzolla'],
   [[0, 2], 'Nostradamus by Maksim Mrvica Vampire Knight Guilty from Vampire Knight soundtrack by Haketa Takefumi'],
   [[0, 3], ''],
   ...
   ]
  [[0, 1, 0, 26, 7],
   [0, 1, 31, 44, 2],
   [0, 1, 45, 55, 7],
   [0, 1, 59, 74, 0],
  [[0, 2, 15, 28, 2], [0, 2, 29, 43, 7]]]
  ...
  ]

Extraction of tables

The tables in the Wiki_TabNER dataset are extracted from the WikiTables corpus. We extracted tables which have on average two linked entities per cell. The extraction of the tables is detailed in the 1.Extract_TabNER_tables notebook.

Labeling the entities

We use the surface text for the entities to match them to DBpedia labeled instances. As shown in the example above, we add the labels of the entities at the end of the table structure, where every labeled entity is represented with a list [row index, col index, start span, end span, label].

The labeling of the entities is detailed in the 2.Extract_TabNER_entities notebook.

We used the following entities types for labeling: Activity, Organisation, Architectural Structure, Event, Place, Person and Work. We provide the labeled dataset and the labeled named entities, linked to their Wikidata IDs in a separate file which can be downloaded here.

Evaluation of LLMs on within tables NER

We provide evaluation of the current LLMs (GPT 3.5, GPT 4 and Llama2) for evaluation of the NER in tables task. The evaluation of the LLMs is in the ner_prompting.py file. In order to run the evaluation of the Open-AI models, a configuration of the parameters is required.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
notebooks		notebooks
tests		tests
.gitignore		.gitignore
README.md		README.md
label_dataset.py		label_dataset.py
ner_prompting.py		ner_prompting.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki-TabNER

Example table

Extraction of tables

Labeling the entities

Evaluation of LLMs on within tables NER

About

Releases

Packages

Languages

table-interpretation/wiki_table_NER

Folders and files

Latest commit

History

Repository files navigation

Wiki-TabNER

Example table

Extraction of tables

Labeling the entities

Evaluation of LLMs on within tables NER

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages