Provenance Network Analytics Datasets

This repository provides the datasets used in the Provenance Network Analytics paper and the code for its analyses. The code was also used to generate the charts shown in our paper. Please note that the information provided here is meant to accompany the paper, where the analytic method is described in more detail.

Overview

Provenance network analytics is a novel data analytics approach that helps infer properties of data, such as quality or trustworthiness, from their provenance. Instead of analysing application data, which are typically domain-dependent, it analyses the data's provenance as represented using the World Wide Web Consortium's domain-agnostic PROV data model. Specifically, the approach proposes a number of network metrics (PNM) for provenance data and applies machine learning techniques over such metrics to build predictive models for some key properties of data. Applying this method on the provenance of real-world data from three different applications, we show that provenance network analytics can successfully identify the owners of provenance documents, assess the trustworthiness of crowdsourced data, and identify instructions from chat messages in an alternate-reality game with high levels of accuracy.

The notebooks and the accompanied datasets provided in this repository demonstrate how the method can be applied in a number of domains as a useful and generic tool for data analytics.

Installation

You do not need to install anything to see the notebooks provided in this repository (linked below). However, if you want to re-run the code on the datasets, you will need to install a number of required Python packages as listed in the requirements.txt as shown below.

The code provided with the datasets were run on Python 3.6. However, it might still run on other Python versions, but this is not guaranteed. All the packages required to run the experiments are listed in requirements.txt. In order to install those, run the following command with pip.

pip install -r requirements.txt

Provenance Datasets

We use three datasets in our paper, which listed below. Each dataset contains a number of provenance graphs and their labels. Instead of providing the actual provenance graphs, due to privacy issues, we only provide here the provenance network metrics calculated from those graphs (which are used in our analyses).

Provenance documents on ProvStore:
- provstore/data.csv: the PNM of provenance documents uploaded to ProvStore and their corresponding owners (anonymised as u_1, u_2, ...)
Provenance of CollabMap data:
- collabmap/trust_values.csv: the trust value of each data entity from CollabMap (identified by the id column).
- collabmap/depgraphs.csv: the PNM of the provenance dependency graph of each data entity. (See our paper for the definition of a provenance dependency graph)
- collabmap/ancestor-graphs.csv: the PNM of the (historical) provenance graph of each data entity (i.e. the graph records how it was generated).
Provenance from the Radiation Response Game (RRG).
- rrg/depgraphs-k.csv, e.g. rrg/depgraphs-5.csv: the PNM of the provenance dependency graph level k of a RRG chat message (k = 1..18).
- rrg/depgraphs.csv: the PNM of the full dependency graph of a RRG chat message (i.e. without restricting a dependency graph to k edges away from a message entity).
- rrg/ancestor-graphs.csv: the PNM of the (historical) provenance graph of the messages.

IPython Notebooks

The notebooks below provide the code for the analysis of the above datasets as reported in our paper. They detail the steps we took in our experiments and also show their results.

Application 1: Identifying the owner of a provenance document
Application 2: Assessing the trustworthiness of crowdsourced data in CollabMap
Application 3: Identifying instructions from chat messages in the Radiation Response Game

In addition, we also provide here extra materials to help with replicating the experiments and to document extra experiments we carried out, which are not included in the paper due to space constraints.

Common cross validation test code: explaining our evaluation method as implemented in analytics.py (and used in the three above notebooks).
Extra 1 - Comparing machine learning algorithms: we compared the performance of a number of classifiers provided by the scikit-learn package over our datasets in terms of accuracy and time.
Extra 2: we compare the performance of the decision tree classifiers on unbalanced datasets v.s. balanced ones. Note that we did not balance data in Application 3 as they are already fairly balanced.
- Extra 2.1 - Unbalanced Data - Application 1
- Extra 2.2 - Unbalanced Data - Application 2
Extra 3: we apply our provenance network analytics method on the historical provenance of data, i.e. the provenance recording how the data was produced, instead of using the dependency graphs of data, or the forward provenance, as shown in Application 2 and Application 3. Since Application 1 looks at provenance graphs from whole provenance documents, this experiment is not applicable.
- Extra 3.1 - Historical Provenance - Application 2
- Extra 3.2 - Historical Provenance - Application 3

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
collabmap		collabmap
figures		figures
provstore		provstore
rrg		rrg
.gitignore		.gitignore
Application 1 - ProvStore Documents.ipynb		Application 1 - ProvStore Documents.ipynb
Application 2 - CollabMap Data Quality.ipynb		Application 2 - CollabMap Data Quality.ipynb
Application 3 - RRG Messages.ipynb		Application 3 - RRG Messages.ipynb
Cross Validation Code.ipynb		Cross Validation Code.ipynb
Extra 1 - Comparing ML algorithms.ipynb		Extra 1 - Comparing ML algorithms.ipynb
Extra 2.1 - Unbalanced Data - Application 1.ipynb		Extra 2.1 - Unbalanced Data - Application 1.ipynb
Extra 2.2 - Unbalanced Data - Application 2.ipynb		Extra 2.2 - Unbalanced Data - Application 2.ipynb
Extra 3.1 - Historical Provenance - Application 2.ipynb		Extra 3.1 - Historical Provenance - Application 2.ipynb
Extra 3.2 - Historical Provenance - Application 3.ipynb		Extra 3.2 - Historical Provenance - Application 3.ipynb
LICENSE		LICENSE
README.md		README.md
analytics.py		analytics.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Provenance Network Analytics Datasets

Overview

Installation

Provenance Datasets

IPython Notebooks

About

Releases

Packages

Languages

License

trungdong/datasets-provanalytics-dmkd

Folders and files

Latest commit

History

Repository files navigation

Provenance Network Analytics Datasets

Overview

Installation

Provenance Datasets

IPython Notebooks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages