Skip to content
Luca Gregori edited this page May 23, 2023 · 2 revisions

Overview

DPDS (Data Provenance for Data Science) is a library capable of capturing fine-grained provenance in a preprocessing pipeline. Built on top of pandas, DPDS provides a clear and clean interface for capturing provenance without the need to invoke additional functions. The data scientist simply needs to implement the pipeline and, after executing it, can analyze the corresponding graph using Neo4j.

Currently, the types of functions captured are as follows:

Category Function Description Examples
Data Reduction Feature Selection One or more features are removed.
Data Reduction Instance Drop One or more records are removed.
Data Augmentation Feature Augmentation One or more features are added.
Data Augmentation Instance Generation One or more records are added.
Space Transformation Dimensionality Reduction Features and records are added/removed. The overall number of removed features and records is greater than those added.
Space Transformation Space Augmentation Features and records are added/removed. The overall number of added features and records is greater than those removed.
Space Transformation Space Transformation Features and records are added/removed. In this case, there can be a reduction in dimensionality for one axis and a space augmentation for the other.
Data Transformation Value Transformation The values of one or more features are transformed. Examples
Data Transformation Imputation Missing values in one or more features are filled with estimated values.
Feature Manipulation Feature Rename One or more features are renamed.
Data Combination Join Two or more datasets are combined based on a common attribute or key.

Getting Started

Requirements

  • neo4j >= 5.7.x
  • pandas 1.5.0

For further details, refer to the requirements.txt file.

Activate venv (sh/bash)

To create a new virtual environment (venv), use the guide at the following link.

source activate venv/bin/activate

Install dependencies

pip install -r requirements.txt

Install Neo4j via Docker (Extra)

It is recommended to install Docker using the official guide at the following link.

To change the options related to the Neo4j Docker image, modify the file neo4j/docker-compose.yml.

Start Neo4j in background:

cd neo4j
docker compose up -d

Stop Neo4j:

cd neo4j
docker compose down

Default credentials:

  • User: neo4j
  • Password: admin

To access the Neo4j web interface: