Skip to content

Commit

Permalink
version 0.5!
Browse files Browse the repository at this point in the history
  • Loading branch information
dachafra committed Dec 23, 2020
1 parent 51d5e12 commit a6d13b9
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 8 deletions.
2 changes: 1 addition & 1 deletion 5_evaluation/parameters.tex
Original file line number Diff line number Diff line change
Expand Up @@ -394,4 +394,4 @@ \subsubsection*{Discussion of the Observed Results}

\subsection{Conclusions}
In this section, we performed an in-depth analysis of the variables and configurations that impact on the behavior of two engines. The observation that existing engines exhibit heterogeneous behaviors whenever small changes in the testbeds are conducted, motivated the need of conducting this study involving a set of parameters that can reveal patterns in the behavior of the studied engines. Additionally, the lack of testbeds encouraged us to acquit the definition of variables and configurations that enable for the characterization of the pitfalls of existing engines and for identifying the list of challenges and research directions in the state of the art.
With the proposed analysis and the results of the experimental study, we contribute with an empirical configuration that can be reused for the evaluation of other knowledge graph creation tools and mapping languages (e.g. SPARQL-Generate, TARQL, or R2RML). Furthermore, our set of variables and configurations can be utilized as a guideline during testing and benchmarking. One of the main lessons learned during the definition and evaluation of our approach, is that none of the evaluated engines behaves consistently whenever the complexity of the testbeds increases. Our ambition is that the reported results inspire the community to define general testbeds that facilitate the understanding of the state of the art and the development of novel tools for the creation of knowledge graphs at large scale. In the future, we plan to define testbeds and conduct a more detailed analysis of other engines and mapping languages. Moreover, we envision to motivate the community to conduct a joint effort in the definition of benchmarks that enable for fair evaluations of knowledge graph creation tools with replicable and generalizable results.
With the proposed analysis and the results of the experimental study, we contribute with an empirical configuration that can be reused for the evaluation of other knowledge graph creation tools and mapping languages (e.g., SPARQL-Generate, TARQL, or R2RML). Furthermore, our set of variables and configurations can be utilized as a guideline during testing and benchmarking. One of the main lessons learned during the definition and evaluation of our approach, is that none of the evaluated engines behaves consistently whenever the complexity of the testbeds increases. Our ambition is that the reported results inspire the community to define general testbeds that facilitate the understanding of the state of the art and the development of novel tools for the creation of knowledge graphs at large scale. In the future, we plan to define testbeds and conduct a more detailed analysis of other engines and mapping languages. Moreover, we envision to motivate the community to conduct a joint effort in the definition of benchmarks that enable for fair evaluations of knowledge graph creation tools with replicable and generalizable results.
9 changes: 3 additions & 6 deletions 8_conclusions/conclusions.tex
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,14 @@ \chapter{Conclusions and Future work}


\section{Achievements}
Constructing knowledge graphs from heterogeneous data sources is a complex data integration problem. Open research problems addressed in this thesis are: (i) the generation and interoperability of different mapping rules specifications to obtain the desirable (virtual) KG, (ii) the creation of representative evaluation methods to provide an overview of the state of the art on the KGC engines and to understand their current limitations, (iii) as well as optimizations techniques to scale up the construction of virtual but also materialized KGs.
Constructing knowledge graphs from heterogeneous data sources is a complex data integration problem. Open research problems addressed in this thesis are: (i) the generation and interoperability of different mapping rules specifications to facilitate to users the KGC process, (ii) the creation of representative evaluation methods to provide an overview of the state of the art on the KGC engines and to understand their current limitations, (iii) as well as optimizations techniques to scale up the construction of virtual but also materialized KGs.


The first objective of the this thesis is focused on define \textbf{representative features of a new knowledge graph construction generation systems}. This is done in Chapter \ref{chapter:mappig-translation}, where the \textit{mapping translation} concept is defined, adding a new layer into a KGC workflow. As we demonstrate with several use cases, exploiting the benefits of making interoperable different mapping languages specifications can enhance several steps of this process. The specific use case shown in this thesis is on the statistics domain, where we propose a set of new properties over the R2RML specification to improve the maintainability of the creation of the mapping rules in this domain. The ideas around this concept are also used over the different optimizations shown in Chapters \ref{chapter:virtual} and \ref{chapter:construction}.

The exploitation of mapping rules to enhance the construction of virtual and materialized knowledge graphs techniques is

To accomplish the second objective of this thesis, described as \textbf{}


The \textbf{exploitation of mapping rules to enhance the construction of virtual and materialized knowledge graphs} techniques is one of the main contributions of this thesis. To the best of our knowledge, the mapping driven optimizations techniques proposed in this work are the first ones that put the focus and exploit information from the semantic annotations. The heuristic based approaches proposed by Morph-CSV (Section \ref{chap6_morphgcsv}) and FunMap (Section \ref{chap7_funmap}) empirically demonstrate over several benchmark and use cases the importance of declarative annotations in a KGC process to efficiently deal with the heterogeneity of input data sources in the current web of data. Additionally, Morph-GraphQL (Section \ref{chap6_morphgraphql}) emphasizes the necessity of semantic web technologies, and more specific, the mapping rules, for avoiding data silos where non-semantic web approaches (e.g., GraphQL, API Rest, etc) are used to expose data on the web. Finally, SDM-RDFizer (Section \ref{chap7_rdfizer}) reveals the importance of well design physical data structures and their corresponding operators to scale-up the construction of knowledge graphs. Summarizing, we have identified the limitations of the proposals of the state of the art together with their open problems and we tackle them from a research perspective, highlighting that engineering solutions are not enough to solve complex data integration problem for constructing knowledge graphs.

To accomplish the second objective of this thesis, described as \textbf{representative evaluation systems for knowledge graph construction engines from heterogeneous data sources}, we present three different contributions. First, we analyze and extend the test cases presenting for RDB2RDF engines to coverage heterogeneous data sources, using RML as mapping language. In this manner, we can provide an overview of the compliance of the engines over this mapping language, which help user and practitioners to select an specific engine for their use cases. Second, we select and analyze the parameters that can impact in the performance and completeness of KGC engines. Our ambition is that the reported results of this contribution, inspire the community to define general testbeds that facilitate the understanding of the state of the art and the development of novel tools for constructing knowledge graphs at large scale. Following this ambition, we define the GTFS-Madrid-Bench, a benchmark for (virtual) KGC engines over the transport domain. Integrating the parameters defined in our previous work and defining a set of representative SPARQL queries, we propose the first benchmark that contributes to evaluate in a representative manner virtual KGC engines from one or multiple data sources and formats. We empirically test our approach over a set of heterogeneous KGC engines and identify multiple and promising future research work lines in this topic. Although the first and second contributions have been tested over materialized KGC engines and the third one over virtual KGC engines, notice that the contributions of this thesis are agnostic to the type of process to be performed, and can be used to test the capabilities of both approaches.


\section{Future Work}
Expand Down
3 changes: 2 additions & 1 deletion appendix/appendix1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,9 @@ \chapter{GTFS-Madrid-Bench Completeness}
}
\end{table}

\newpage

\begin{table}[th]
\begin{table}[ht]
\centering
\caption[Completeness GTFS-100]{Completeness of benchmark queries in experiment configurations with GTFS-100 dataset. Minus means that the processor is not able to execute the query (i.e. generates an error) or it does not evaluate the query within the timeout duration.}
\label{tab:results5}
Expand Down

0 comments on commit a6d13b9

Please sign in to comment.