Skip to content

Data Preparation

Giuseppe Futia edited this page Aug 24, 2020 · 2 revisions

Taheryan 2016 Datasets

WARNING: this step has been already performed, so all data is available in the repository.

For the data preparation, you can follow the instructions available below.

Source Generation in JSON format

Currently, SeMi is able to process only JSON files, because it adopts the JARQL engine for the serialization of the semantic model. For such reason, CSV and XML files included in Taheriyan data need to be converted in JSON format. To perform this operation you can run:

$ node preparation/taheriyan2016/convert_source_in_json.js task_01
$ node preparation/taheriyan2016/convert_source_in_json.js task_02
$ node preparation/taheriyan2016/convert_source_in_json.js task_03
$ node preparation/taheriyan2016/convert_source_in_json.js task_04

The output of this step is stored in the data/taheriyan2016/{task}/sources/original_json folder.

Automatic Semantic Types

Ground truth from the original data of Taheriyan can be exploited to automatically create an initial version of semantic types. To create such semantic types, you can run:

$ node preparation/taheriyan2016/create_gt_st.js task_01
$ node preparation/taheriyan2016/create_gt_st.js task_02
$ node preparation/taheriyan2016/create_gt_st.js task_03
$ node preparation/taheriyan2016/create_gt_st.js task_04

The output of this step is stored in the data/taheriyan2016/{task}/semantic_types/auto folder.

Cleaning and Enriching

Input Sources Analysis

The semantic type extraction from ground truth data shows that some attributes or columnNodes, as defined by Taheriyan, are added to the semantic types to trigger specific parsing operations.

For instance, if an attribute of the source is called "artist" and we want to generate a URI from this field, a new node called "artistURI" is added among the semantic type nodes. As a consequence, to maintain the coherence between semantic types and source attributes, we need to enrich the source data with an attribute corresponding to the "artistURI" semantic type.

Other issues are related to the conversion step in JSON format. In particular, in the case of the conversion from XML files, redundant fields are added in the JSON file.

Semantic Types Analysis

As reported in the previous section, special keywords have been introduced within semantic types to trigger specific parsing operations. For instance, the keyword value has been injected by Taheriyan within the JSON file to enable the parsing of an array of objects.

For this reason, semantic types generated automatically require minor updates.

Scripts for Cleaning

To address both these issues, we perform some changes in order to prepare and clean data for SeMi ingestion and processing. To run the script, you can execute:

$ node preparation/taheriyan2016/clean.js task_01
$ node preparation/taheriyan2016/clean.js task_02
$ node preparation/taheriyan2016/clean.js task_03
$ node preparation/taheriyan2016/clean.js task_04

Updated sources are available in the data/taheriyan2016/{task}/sources/update_json folder. Updated semantic types are available in the data/taheriyan2016/{task}/semantic_types/updated folder.

WARNING: the generated JSON file s21-s-met.json is broken, so it is removed from the evaluation. As a consequence, the related semantic type file s21-s-met_st.json is removed too.

WARNING: the generate input JSON file data/taheriyan2016/task_04/sources/updated_json/tennesseegunexchange.json has been updated due to bug in the generation of refined semantic model. The change is available in the following commit: https://github.com/giuseppefutia/semi/commit/70441d4e29086c62f8307f0064908a5c15ab41b5. I noticed that it is a recurrent bug, so I manually updated other sources of task_04: https://github.com/giuseppefutia/semi/commit/c18c18f6949301171b681b0956711c0a4790165b.

SeMi Semantic Models (Initial)

After the data cleaning step, it is possible to create the initial semantic models using SeMi. This task is based on the generation of a multi-edge and weighted graph based on the ontology structure. A steiner detection algorithm is performed on this graph in order to create the initial semantic model. You can run the following scripts:

$ node preparation/taheriyan2016/create_semi_sm.js task_01
$ node preparation/taheriyan2016/create_semi_sm.js task_02
$ node preparation/taheriyan2016/create_semi_sm.js task_03
$ node preparation/taheriyan2016/create_semi_sm.js task_04

This graph is serialized in three different ways:

  • .json
  • .dot
  • .jarql

Details of these serializations can be explored in the examples available here.

The output of this folder are available in the data/taheriyan2016/{task}/semantic_models/ folder and in the evaluation/taheriyan2016/{task}/semantic_models_steiner/ folder.

Ground Truth Semantic Models

Now it is possible to serialize the ground truth semantic models in the SeMi format (JARQL) and, from these models, generate the RDF files.

This step has a twofold contribution:

  1. enabling the comparison of the semantic models created automatically by SeMi against the ground truth semantic models;
  2. creating the KGs to train and validate the deep learning model adopted for the link prediction.

Ground Truth Serialization into SeMi format (JARQL).

To serialize the ground truth semantic models in the SeMi format (JARQL), you can run the following command:

$ node preparation/taheriyan2016/create_gt_sm.js task_01
$ node preparation/taheriyan2016/create_gt_sm.js task_02
$ node preparation/taheriyan2016/create_gt_sm.js task_03
$ node preparation/taheriyan2016/create_gt_sm.js task_04

The JARQL version of the ground truth semantic models is available in the evaluation/taheriyan2016/{task}/semantic_models_gt/jarql folder. This representation is directly exploited for the evaluation. You will also create a JARQL serialization of the semantic types. The JARQL representation of semantic types is stored in the evaluation/taheriyan2016/{task}/semantic_models_gt/jarql_st. These semantic types will be integrated in the background knowledge for training purposes.

Background Knowledge

WARNING: The current implementation of JARQL has problems with nested arrays. For this reason, we need to apply some manual fixes to the s19-s-indianapolis-artworks.query in task_03 in order to create the background knowledge. These fixes are available in the following commit: https://github.com/giuseppefutia/semi/commit/e0c1def0e5c465d1399f7be21d9609d5d058f915?diff=unified. This fixes will not impact the evaluation, but only the quality of the background knowledge used for training.

WARNING: We encountered issues in the generation of the background knowledge due to the URI generation (binding process) within the file s08-s-17-edited. Therefore, we apply some fixes available in the following commit: https://github.com/giuseppefutia/semi/commit/e3da5896d0f2f9bd5a8a384fd2e97db4bc9cc235.

In order to train the link prediction model, we have to produce the background knowledge. For task_01, task_03, and task_04 we adopted the leave-one-out setting. It means that, given a target source for the semantic, the background is built on the other sources included in the dataset. As mentioned before, in the background knowledge related to the target source are included also the target semantic types.

RDF Generation

To produce RDF data from through the ground truth semantic models, run the following commands:

$ node preparation/taheriyan2016/create_gt_rdf.js task_01
$ node preparation/taheriyan2016/create_gt_rdf.js task_03
$ node preparation/taheriyan2016/create_gt_rdf.js task_04

The RDF output files built using JARQL serialization of semantic models are available in the evaluation/taheriyan2016/{task}/semantic_models_gt/rdf folder. The RDF files created from the JARQL serialization of the semantic types are available in the evaluation/taheriyan2016/{task}/semantic_models_gt/rdf_st folder.

Background knowledge for the leave-one-out setting

In order to create the background knowledge in the case of leave-one-out setting, you can run the following command:

$ node preparation/taheriyan2016/create_gt_background.js task_01
$ node preparation/taheriyan2016/create_gt_background.js task_03
$ node preparation/taheriyan2016/create_gt_background.js task_04

Split background knowledge for building training, test, and validation sets

To split the background knowledge in training, test, and validation sets you can run the following commands:

$ python preparation/taheriyan2016/split_gt_background.py task_01
$ python preparation/taheriyan2016/split_gt_background.py task_03
$ python preparation/taheriyan2016/split_gt_background.py task_04