Merge pull request #3 from tabbydoc/dev

A prototype of TabbyLD2 (version based on separate methods and result dicts)
tabbydoc · Jan 30, 2022 · 2eabc86 · 2eabc86
2 parents 0beef33 + 5d35508
commit 2eabc86
Show file tree

Hide file tree

Showing 17 changed files with 1,628 additions and 605 deletions.
diff --git a/.gitignore b/.gitignore
@@ -138,4 +138,7 @@ dmypy.json
 cython_debug/
 
 # idea
-.idea/
+.idea/
+
+# processing result files
+results/
diff --git a/README.md b/README.md
@@ -1,3 +1,85 @@
-# TabbyLD version 2.0
+# TabbyLD2
 
-A web-based application to annotate web tables and generate knowledge graphs
+A web-based application to annotate relational tables and generate knowledge graphs.
+
+## Version
+
+0.1
+
+## Preliminaries
+
+A source (input) table represents a set of the same type entities in a relational form (a subset of the Cartesian product of *K*-data domains), where:
+1.	*Attribute (a column name)* is a name of a data domain in a relationship schema;
+2.	*Metadata (a schema)* is an ordered set of *K*-attributes of a relational table;
+3.	*Tuple (a record)* is an ordered set of *K*-atomic values (one for each attribute of a relation);
+4.	*Data (recordset)* is a set of tuples of a relational table.
+
+A table of the same type entities (*a canonicalized form*) is a relational table in the third normal form (3NF), which contains an ordered set of *N*-rows and *M*-columns.
+
+A table represents a set of entities of the same type, where:
+1.	*Categorical column or Named entities column (NE-column)* contains names (text mentions) of some named entities;
+2.	*Literal column (L-column)* contains literal values (e.g. dates, numbers);
+3.	*Subject (thematic) column (S-column)* is a *NE*-column represented as a potential primary key and defines a subject of a source table;
+4.	*Another (non-subject) columns* represent entity properties including their relationships with other entities;
+
+**Assumption 1.** *The first row of a source table is a header containing attribute (column) names.*
+
+**Assumption 2.** *All values of column cells in a source table have the same entity types and data types.*
+
+**Assumption 3.** *TabbyLD2 supports a semantic interpretation (annotation) of separate elements of a source table by using a target knowledge graph. [DBpedia](https://www.dbpedia.org/) is used as a target knowledge graph..*
+
+#### Semantic table interpretation
+*Semantic Table Interpretation (STI)* is the process of recognizing and linking tabular data with external concepts from a target knowledge graph, which includes three main tasks:
+1.	*Cell-Entity Annotation (CEA)* is a matching between values of table cells and entities (specific instances) from a target knowledge graph;
+2.	*Column-Type Annotation (CTA)* is a matching between columns (or headers, if available) and classes or datatypes from a target knowledge graph;
+3.	*Columns Property Annotation (CPA)* is a matching relationship between two columns (S-column and all other columns) and properties (relationships) from a target knowledge graph.
+
+## Installation
+
+First, you need to clone the project into your directory
+
+```
+git clone https://github.com/tabbydoc/tabbyld2.git
+```
+
+Next, you need to install all requirements for this project
+
+```
+pip install -r requirements.txt
+```
+
+*We recommend you to use Python 3.0 or more*
+
+#### Additional software
+
+In addition to [SPARQL](https://www.w3.org/TR/rdf-sparql-query/) queries, we use [DBpedia Lookup](https://github.com/dbpedia/dbpedia-lookup) to find candidate entities from DBpedia. This service requires a separate installation.
+
+## Usage
+
+#### Console mode
+
+In order to use the TabbyLD2 in *console mode*, you may run the following command:
+
+```
+python main.py
+```
+
+Run this script to process source tables in CSV format. Tables must be located in the `source_tables` directory.
+
+The processing result are presented as JSON format and will be saved to the `results` directory (`json` and `provenance` subdirectories).
+
+#### Web mode
+
+In order to use the TabbyLD2 in *web mode*, you may run the following command:
+
+```
+python app.py
+```
+
+**NOTE:** *This mode does not work at the moment!*
+
+## Authors
+
+* [Nikita O. Dorodnykh](mailto:tualatin32@mail.ru)
+* [Daria A. Denisova](mailto:daryalich@mail.ru)
+* [Vitaliy V. Biryuckov](mailto:stukov.biryuckov2017@yandex.ru)
diff --git a/requirements.txt b/requirements.txt
@@ -1,7 +1,15 @@
-ftfy~=5.8
-flask~=1.1.2
-pandas~=1.2.3
-stanza~=1.2.0
-werkzeug~=1.0.1
-requests~=2.25.1
-gql~=3.0.0a5
+# Main packages
+ftfy
+flask
+pandas
+stanza
+werkzeug
+requests
+sparqlwrapper
+python-Levenshtein
+# Packages for entity embeddings
+gensim
+pyrdf2vec
+nest_asyncio
+aiohttp
+attr
diff --git a/source_tables/media.csv b/source_tables/media.csv
@@ -0,0 +1,21 @@
+#, Media, MIX
+1, Dainik Jagran, 15.400
+2, Dainik Bhaskar, 14.000
+3, CNN Editions (International), 14.000
+4, CNN, 12.000
+5, NDTV, 10.000
+6, Times of India, 4.800
+7, Globo, 4.500
+8, Dailymail, 4.500
+9, Malayala Manorama, 4.000
+10, Dinamalar, 3.500
+11, WALL STREET JOURNAL USA, 3.400
+12, foxnews, 3.300
+13, New York Times, 3.250
+14, Gujarat Samachar, 3.000
+15, Telecinco, 2.800
+16, IBN live, 2.800
+17, USA Today, 2.525
+18, The Sun, 2.500
+19, Joong Ang Ilbo, 2.250
+20, AARP Bulletin, 2.160
diff --git a/source_tables/test.csv b/source_tables/test.csv
@@ -0,0 +1,4 @@
+    album, year, US_peak_chart_post
+    The White Stripes, 1999, 61
+    De Stijl, 2000, 4
+    White Blood Cells, 2001, -