Skip to content

Dataset of PNG images from synthetically generated table layouts with annotations in JSONL files

License

Notifications You must be signed in to change notification settings

IBM/SynthTabNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SynthTabNet

SynthTabNet is a dataset of 600k png images from synthetically generated table layouts with annotations in jsonl files.

Overview

SynthTabNet is a synthetically generated dataset that contains annotated images of data in tabular layouts.

It has been shown that other non-synthetic datasets like PubTabNet, FinTabNet and TableBank suffer from many limitations:

  • Their table distributions are skewed towards simpler structures with fewer number of rows/columns.
  • There is a very limited variance in the appearance styles.
  • The content is sometimes restricted to certain domains.
  • The bounding boxes are omitted for non-empty cells or they are completely absent.

SynthTabNet aims to overcome these limitations by providing:

  • A broad range of table sizes and richer combinations of row spans /column spans.
  • A variety of domain specific styling appearances (e.g. financial data, marketing data, sparse tables etc.)
  • Content generated out of the most frequently used terms appearing in non-synthetic datasets (e.g. PubTabNet, FinTabNet, etc.)
  • Bounding boxes for all table cells, including the empty ones.
  • Rectangular table structures. For each table, every row has the same number of columns after taking into account any row spans /column spans.

SynthTabNet is organized into 4 parts of 150k tables (600k in total). Each part contains tables with different appearances in regard to their size, structure, style and content. All parts are divided into Train, Test and Val splits (80%, 10%, 10%). The tables are delivered as png images and the annotations are in jsonl format.

A detailed description of the data synthesis process can be found in the paper.

Download

v2.0.0

Appearance style Records Size(GB) URL v2.0.0
Fintabnet 150k 10 SynthTabNet-part1
Marketing 150k 8 SynthTabNet-part2
PubTabNet 150k 6 SynthTabNet-part3
Sparse 150k 3 SynthTatNet-part4

v2.0.0 MD5 checksums

v2.0.0 SHA1 checksums

v1.0.0
Appearance style Records Size(GB) URL v1.0.0
Fintabnet 150k 10 SynthTabNet-part1
Marketing 150k 8 SynthTabNet-part2
PubTabNet 150k 6 SynthTabNet-part3
Sparse 150k 3 SynthTatNet-part4

v1.0.0 MD5 checksums

v1.0.0 SHA1 checksums

Data format

Each part of the dataset corresponds to a top level directory (fintabnet, marketing, pubtabnet, sparse) and has the following structure:

├── images
│   ├── test
│   ├── train
│   └── val
├── synthetic_data.jsonl

The annotations for each part are in the synthetic_data.jsonl file. Each line is a json object that corresponds to a png image and has the following structure:

"filename": "png image filename inside one of the 'test', 'train', 'val' directories",
"split": "One of 'test', 'train', 'val'",
"html": "Table structure and content",
    "cells": "Array with all table cells",
        "cell_id": "Zero based cell counter",
        "is_header": "true if that cell is part of the table header",
        "span": "In case there is a rowspan / columnspan",
            "spantype": "One of 'rowspan', 'colspan', '2dspan'. The '2dspan' is used in case there is a rowspan and colspan in the same cell",
            "rowspan": "Number of rowspans for this cell",
            "colspan": "Number of colspans for this cell"
        "tokens": "Array with the tokenized content of the cell",
        "bbox": "The bounding bbox and the class of the cell in [x1, y1, x2, y2, class] format"
    "structure":
        "tokens": "Array with html tags that describe the table structure"

Regarding the bbox parameter notice that:

  • The coordinates origin is the top left corner of the image.
  • Each bbox is described by its top left corner (x1, y1) and bottom right corner (x2, y2).
  • The bbox class can have the values:
    • 1: An empty cell
    • 2: A non-empty cell

The tokens can be one of:

" colspan=\"10\"", " colspan=\"2\"", " colspan=\"3\"", " colspan=\"4\"", " colspan=\"5\"",
" colspan=\"6\"", " colspan=\"7\"", " colspan=\"8\"", " colspan=\"9\"", " rowspan=\"10\"",
" rowspan=\"2\"", " rowspan=\"3\"", " rowspan=\"4\"", " rowspan=\"5\"", " rowspan=\"6\"",
" rowspan=\"7\"", " rowspan=\"8\"", " rowspan=\"9\"", "</tbody>", "</td>", "</thead>",
"</tr>", "<end>", "<pad>", "<start>", "<tbody>", "<td", "<td>", "<thead>", "<tr>", "<unk>", ">"

Example data

pubtabnet

sparse

fintabnet

marketing

Jupyter notebook

Here is a jupyter notebook that demonstrates how to download and use the dataset:

Demo Notebook

Paper

"TableFormer: Table Structure Understanding with Transformers" (CVPR 2022).

ArXiv link: https://arxiv.org/abs/2203.01017

Citation:

@article{nassar2022tableformer,
  title={TableFormer: Table Structure Understanding with Transformers},
  author={Nassar, Ahmed and Livathinos, Nikolaos and Lysak, Maksym and Staar, Peter},
  journal={arXiv preprint arXiv:2203.01017},
  year={2022}
}