Skip to content

Commit

Permalink
Merge pull request #3 from tabbydoc/dev
Browse files Browse the repository at this point in the history
A prototype of TabbyLD2 (version based on separate methods and result dicts)
  • Loading branch information
LedZeppe1in authored Jan 30, 2022
2 parents 0beef33 + 5d35508 commit 2eabc86
Show file tree
Hide file tree
Showing 17 changed files with 1,628 additions and 605 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -138,4 +138,7 @@ dmypy.json
cython_debug/

# idea
.idea/
.idea/

# processing result files
results/
86 changes: 84 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,85 @@
# TabbyLD version 2.0
# TabbyLD2

A web-based application to annotate web tables and generate knowledge graphs
A web-based application to annotate relational tables and generate knowledge graphs.

## Version

0.1

## Preliminaries

A source (input) table represents a set of the same type entities in a relational form (a subset of the Cartesian product of *K*-data domains), where:
1. *Attribute (a column name)* is a name of a data domain in a relationship schema;
2. *Metadata (a schema)* is an ordered set of *K*-attributes of a relational table;
3. *Tuple (a record)* is an ordered set of *K*-atomic values (one for each attribute of a relation);
4. *Data (recordset)* is a set of tuples of a relational table.

A table of the same type entities (*a canonicalized form*) is a relational table in the third normal form (3NF), which contains an ordered set of *N*-rows and *M*-columns.

A table represents a set of entities of the same type, where:
1. *Categorical column or Named entities column (NE-column)* contains names (text mentions) of some named entities;
2. *Literal column (L-column)* contains literal values (e.g. dates, numbers);
3. *Subject (thematic) column (S-column)* is a *NE*-column represented as a potential primary key and defines a subject of a source table;
4. *Another (non-subject) columns* represent entity properties including their relationships with other entities;

**Assumption 1.** *The first row of a source table is a header containing attribute (column) names.*

**Assumption 2.** *All values of column cells in a source table have the same entity types and data types.*

**Assumption 3.** *TabbyLD2 supports a semantic interpretation (annotation) of separate elements of a source table by using a target knowledge graph. [DBpedia](https://www.dbpedia.org/) is used as a target knowledge graph..*

#### Semantic table interpretation
*Semantic Table Interpretation (STI)* is the process of recognizing and linking tabular data with external concepts from a target knowledge graph, which includes three main tasks:
1. *Cell-Entity Annotation (CEA)* is a matching between values of table cells and entities (specific instances) from a target knowledge graph;
2. *Column-Type Annotation (CTA)* is a matching between columns (or headers, if available) and classes or datatypes from a target knowledge graph;
3. *Columns Property Annotation (CPA)* is a matching relationship between two columns (S-column and all other columns) and properties (relationships) from a target knowledge graph.

## Installation

First, you need to clone the project into your directory

```
git clone https://github.com/tabbydoc/tabbyld2.git
```

Next, you need to install all requirements for this project

```
pip install -r requirements.txt
```

*We recommend you to use Python 3.0 or more*

#### Additional software

In addition to [SPARQL](https://www.w3.org/TR/rdf-sparql-query/) queries, we use [DBpedia Lookup](https://github.com/dbpedia/dbpedia-lookup) to find candidate entities from DBpedia. This service requires a separate installation.

## Usage

#### Console mode

In order to use the TabbyLD2 in *console mode*, you may run the following command:

```
python main.py
```

Run this script to process source tables in CSV format. Tables must be located in the `source_tables` directory.

The processing result are presented as JSON format and will be saved to the `results` directory (`json` and `provenance` subdirectories).

#### Web mode

In order to use the TabbyLD2 in *web mode*, you may run the following command:

```
python app.py
```

**NOTE:** *This mode does not work at the moment!*

## Authors

* [Nikita O. Dorodnykh](mailto:tualatin32@mail.ru)
* [Daria A. Denisova](mailto:daryalich@mail.ru)
* [Vitaliy V. Biryuckov](mailto:stukov.biryuckov2017@yandex.ru)
22 changes: 15 additions & 7 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
ftfy~=5.8
flask~=1.1.2
pandas~=1.2.3
stanza~=1.2.0
werkzeug~=1.0.1
requests~=2.25.1
gql~=3.0.0a5
# Main packages
ftfy
flask
pandas
stanza
werkzeug
requests
sparqlwrapper
python-Levenshtein
# Packages for entity embeddings
gensim
pyrdf2vec
nest_asyncio
aiohttp
attr
21 changes: 21 additions & 0 deletions source_tables/media.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#, Media, MIX
1, Dainik Jagran, 15.400
2, Dainik Bhaskar, 14.000
3, CNN Editions (International), 14.000
4, CNN, 12.000
5, NDTV, 10.000
6, Times of India, 4.800
7, Globo, 4.500
8, Dailymail, 4.500
9, Malayala Manorama, 4.000
10, Dinamalar, 3.500
11, WALL STREET JOURNAL USA, 3.400
12, foxnews, 3.300
13, New York Times, 3.250
14, Gujarat Samachar, 3.000
15, Telecinco, 2.800
16, IBN live, 2.800
17, USA Today, 2.525
18, The Sun, 2.500
19, Joong Ang Ilbo, 2.250
20, AARP Bulletin, 2.160
4 changes: 4 additions & 0 deletions source_tables/test.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
album, year, US_peak_chart_post
The White Stripes, 1999, 61
De Stijl, 2000, 4
White Blood Cells, 2001, -
Loading

0 comments on commit 2eabc86

Please sign in to comment.