-
Notifications
You must be signed in to change notification settings - Fork 8
Data Preparation
WARNING: this step has been already performed, so all data is available in the repository.
For the data preparation, you can follow the instructions available below.
Currently, SeMi is able to process only JSON files, because it adopts the JARQL engine for the serialization of the semantic model. For such reason, CSV and XML files included in Taheriyan data need to be converted in JSON format. To perform this operation you can run:
$ node preparation/taheriyan2016/convert_source_in_json.js task_01
$ node preparation/taheriyan2016/convert_source_in_json.js task_02
$ node preparation/taheriyan2016/convert_source_in_json.js task_03
$ node preparation/taheriyan2016/convert_source_in_json.js task_04
The output of this step is stored in the data/taheriyan2016/{task}/sources/original_json
folder.
Ground truth from the original data of Taheriyan can be exploited to automatically create an initial version of semantic types. To create such semantic types, you can run:
$ node preparation/taheriyan2016/create_gt_st.js task_01
$ node preparation/taheriyan2016/create_gt_st.js task_02
$ node preparation/taheriyan2016/create_gt_st.js task_03
$ node preparation/taheriyan2016/create_gt_st.js task_04
The output of this step is stored in the data/taheriyan2016/{task}/semantic_types/auto
folder.
The semantic type extraction from ground truth data shows that some attributes or columnNodes, as defined by Taheriyan, are added to the semantic types to trigger specific parsing operations.
For instance, if an attribute of the source is called "artist" and we want to generate a URI from this field, a new node called "artistURI" is added among the semantic type nodes. As a consequence, to maintain the coherence between semantic types and source attributes, we need to enrich the source data with an attribute corresponding to the "artistURI" semantic type.
Other issues are related to the conversion step in JSON format. In particular, in the case of the conversion from XML files, redundant fields are added in the JSON file.
As reported in the previous section, special keywords have been introduced within semantic types to trigger specific parsing operations. For instance, the keyword value
has been injected by Taheriyan within the JSON file to enable the parsing of an array of objects.
For this reason, semantic types generated automatically require minor updates.
To address both these issues, we perform some changes in order to prepare and clean data for SeMi ingestion and processing. To run the script, you can execute:
$ node preparation/taheriyan2016/clean.js task_01
$ node preparation/taheriyan2016/clean.js task_02
$ node preparation/taheriyan2016/clean.js task_03
$ node preparation/taheriyan2016/clean.js task_04
Updated sources are available in the data/taheriyan2016/{task}/sources/update_json
folder.
Updated semantic types are available in the data/taheriyan2016/{task}/semantic_types/updated
folder.
WARNING: the generated JSON file
s21-s-met.json
is broken, so it is removed from the evaluation. As a consequence, the related semantic type files21-s-met_st.json
is removed too.
WARNING: the generate input JSON file
data/taheriyan2016/task_04/sources/updated_json/tennesseegunexchange.json
has been updated due to bug in the generation of refined semantic model. The change is available in the following commit: https://github.com/giuseppefutia/semi/commit/70441d4e29086c62f8307f0064908a5c15ab41b5. I noticed that it is a recurrent bug, so I manually updated other sources of task_04: https://github.com/giuseppefutia/semi/commit/c18c18f6949301171b681b0956711c0a4790165b.
After the data cleaning step, it is possible to create the initial semantic models using SeMi. This task is based on the generation of a multi-edge and weighted graph based on the ontology structure. A steiner detection algorithm is performed on this graph in order to create the initial semantic model. You can run the following scripts:
$ node preparation/taheriyan2016/create_semi_sm.js task_01
$ node preparation/taheriyan2016/create_semi_sm.js task_02
$ node preparation/taheriyan2016/create_semi_sm.js task_03
$ node preparation/taheriyan2016/create_semi_sm.js task_04
This graph is serialized in three different ways:
- .json
- .dot
- .jarql
Details of these serializations can be explored in the examples available here.
The output of this folder are available in the data/taheriyan2016/{task}/semantic_models/
folder and in the evaluation/taheriyan2016/{task}/semantic_models_steiner/
folder.
Now it is possible to serialize the ground truth semantic models in the SeMi format (JARQL) and, from these models, generate the RDF files.
This step has a twofold contribution:
- enabling the comparison of the semantic models created automatically by SeMi against the ground truth semantic models;
- creating the KGs to train and validate the deep learning model adopted for the link prediction.
To serialize the ground truth semantic models in the SeMi format (JARQL), you can run the following command:
$ node preparation/taheriyan2016/create_gt_sm.js task_01
$ node preparation/taheriyan2016/create_gt_sm.js task_02
$ node preparation/taheriyan2016/create_gt_sm.js task_03
$ node preparation/taheriyan2016/create_gt_sm.js task_04
The JARQL version of the ground truth semantic models is available in the evaluation/taheriyan2016/{task}/semantic_models_gt/jarql
folder. This representation is directly exploited for the evaluation. You will also create a JARQL serialization of the semantic types. The JARQL representation of semantic types is stored in the evaluation/taheriyan2016/{task}/semantic_models_gt/jarql_st
. These semantic types will be integrated in the background knowledge for training purposes.
WARNING: The current implementation of JARQL has problems with nested arrays. For this reason, we need to apply some manual fixes to the
s19-s-indianapolis-artworks.query
intask_03
in order to create the background knowledge. These fixes are available in the following commit: https://github.com/giuseppefutia/semi/commit/e0c1def0e5c465d1399f7be21d9609d5d058f915?diff=unified. This fixes will not impact the evaluation, but only the quality of the background knowledge used for training.
WARNING: We encountered issues in the generation of the background knowledge due to the URI generation (binding process) within the file
s08-s-17-edited
. Therefore, we apply some fixes available in the following commit: https://github.com/giuseppefutia/semi/commit/e3da5896d0f2f9bd5a8a384fd2e97db4bc9cc235.
In order to train the link prediction model, we have to produce the background knowledge. For task_01
, task_03
, and task_04
we adopted the leave-one-out setting. It means that, given a target source for the semantic, the background is built on the other sources included in the dataset. As mentioned before, in the background knowledge related to the target source are included also the target semantic types.
To produce RDF data from through the ground truth semantic models, run the following commands:
$ node preparation/taheriyan2016/create_gt_rdf.js task_01
$ node preparation/taheriyan2016/create_gt_rdf.js task_03
$ node preparation/taheriyan2016/create_gt_rdf.js task_04
The RDF output files built using JARQL serialization of semantic models are available in the evaluation/taheriyan2016/{task}/semantic_models_gt/rdf
folder. The RDF files created from the JARQL serialization of the semantic types are available in the evaluation/taheriyan2016/{task}/semantic_models_gt/rdf_st
folder.
In order to create the background knowledge in the case of leave-one-out setting, you can run the following command:
$ node preparation/taheriyan2016/create_gt_background.js task_01
$ node preparation/taheriyan2016/create_gt_background.js task_03
$ node preparation/taheriyan2016/create_gt_background.js task_04
To split the background knowledge in training, test, and validation sets you can run the following commands:
$ python preparation/taheriyan2016/split_gt_background.py task_01
$ python preparation/taheriyan2016/split_gt_background.py task_03
$ python preparation/taheriyan2016/split_gt_background.py task_04
SeMi - SEmantic Modeling machIne