The pre-training data consists of 6.2 million table-text examples extracted from the English Wikipedia on December 2019. The associated text of a table is the page title and description, table caption as well as the section title and section text.
This is an example in proto text format extracted from this page.
table: {
columns: { text: "Year" }
columns: { text: "Film" }
columns: { text: "Dialogue-writer(s)" }
rows: {
cells: { text: "2013\n(1st)" }
cells: { text: "" }
cells: { text: "" }
}
rows: {
cells: { text: "2013\n(1st)" }
cells: { text: "Main Hoon Shahid Afridi" }
cells: { text: "Vasay Chaudhry" }
}
table_id: "http://en.wikipedia.org/wiki/ARY_Film_Award_for_Best_Dialogue_1"
}
questions: {
id: "TITLE"
original_text: "ARY Film Award for Best Dialogue"
}
questions: {
id: "DESCRIPTION"
original_text: "The ARY Film Award for Best Dialogue is the ARY Film Award for the best dialogues of the year in film. It is one of three writing awards in the Technical Awarding category."
}
questions: {
id: "SEGMENT_TITLE"
original_text: "2010s"
}
You can find the latest version of the data here. We also provide a small snapshot of the first 100 interactions.
create_pretrain_examples_main.py
converts the data to TF examples.
It can be run locally (that will take a long time on a single machine) or as a Dataflow on Google Cloud.
You can find command line snippets here.
In case you want to work with the data in ways we didn't anticipate you can simple parse them into proto objects line-by-line.
Here is a simple example:
from google.protobuf import text_format
from tapas.protos import interaction_pb2
for line in input_file:
interaction = text_format.Parse(line, interaction_pb2.Interaction())
This data is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
See also the Wikipedia Copyrights page.
You can cite the ACL 2020 paper.