Welcome to the supplementary material for the paper:
- Edouard Fouché, Alan Mazankiewicz, Florian Kalinke and Klemens Böhm. 2020. A framework for dependency estimation in heterogeneous data streams. Distributed and Parallel Databases, June 6, 2020. 30 pages. https://doi.org/10.1007/s10619-020-07295-x
This repository contains the original implementation of MCDE with estimators MWP, KSP and CSP, and the information to reproduce the experiments in the paper. For this reason, it is partially frozen at the time of publication.
This repository is released under the AGPLv3 license. Please see the LICENSE.md file.
See also the following repositories: MCDE, MCDE-experiments, as well as the following publication:
- Edouard Fouché & Klemens Böhm. 2019. Monte Carlo Dependency Estimation. In 31st International Conference on Scientific and Statistical Database Management (SSDBM ’19), July 23–25, 2019, Santa Cruz, CA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3335783.3335795
Requirements : (Oracle JDK 8 or OpenJDK 8) and sbt
The project is built with sbt (version 1.2.8). You can compile, package or run the project as follows:
sbt compile
sbt package
sbt "run <arguments>"
You can also export a "fat" jar, including all dependencies and scala libraries using sbt-assembly
:
sbt assembly
This creates a jar in the folder target/scala-2.12/
named MCDE-EXTENDED-<version>.jar
, which can be run from java
(no sbt/scala installation required). The version of the package at the time of the experiments is 0.1.0.
Once you have built the jar, you can run it as follows:
java -jar target/scala-2.12/MCDE-EXTENDED-0.1.0.jar <arguments>
You may find examples of usage of the estimators in src/scala/io/githubedouardfouche/worksheets/user_guide.sc
In this section, we explain how to reproduce the experiments from our paper.
The experiments create about 1.7G of data and require about 5 days on a server with 64 cores at 3.0 Ghz and 128GB RAM,
using Java Open-JDK 8 and Scala 2.12.8. Results are saved in the folder experiments
as .csv
files, along with logs.
Evaluate the distribution of contrast values of MCDE (MWP, KSP, CSP and variants) against continuous,
categorical and ordinal data. This experiment creates much data, but one may reduce the number of repetitions.
See io.github.edouardfouche.experiments.Contrast
.
sbt "run com.edouardfouche.experiments.Contrast" # ~ 0.5 hour, 1.5GB data
Evaluate the statistical power of MCDE (MWP, KSP, CSP and variants) against a panel of dependencies.
sbt "run com.edouardfouche.experiments.Power" # ~ 5 hours, 2MB data
Evaluate the statistical power of MCDE (MWP, KSP, CSP and variants) against a panel of discrete linear dependencies.
sbt "run com.edouardfouche.experiments.PowerDiscrete" # ~ 2 hours, 17MB data
Evaluate the performance (runtime) of the proposed index structure w.r.t. increasing window size.
sbt "run com.edouardfouche.experiments.PerformanceIndex" # ~ 13 hours, 51MB data
Evaluate the performance (runtime) of contrast measures w.r.t. increasing window size. This experiment may take much time, but may be interrupted before ending without much impact on the quality of the results.
sbt "run com.edouardfouche.experiments.PerformanceContrast" # ~ 3 days, 58MB data
Mine the contrast for numerous pairs for our use case (Bioliq).
sbt "run com.edouardfouche.experiments.StreamMiner" # ~ 3 hours, 8MB data
Estimate the quality of contrast monitoring w.r.t. increasing number of iterations M and step size.
sbt "run com.edouardfouche.experiments.StreamEstimatorPerformance" # ~ 21 hours, 42MB data
We attach the data that we used for our case study (it is a sample from data produced by the Bioliq power plant) in the root of this repository.
See bioliq_1w_MCDE.7z
. Simply decompressed the archive and place the .csv
file into <HOME_FOLDER>/data/bioliq/
.
Adjust the hard-coded path in io.github.edouardfouche.experiments.Data
if necessary.
Note that we plan to release more data from the Bioliq plant in the future.
Then, you can use the Jupyter notebooks in folder visualize
to reproduce
the plots from the publication. By the time of the experiments, we use the following Python packages:
# Name Version
matplotlib 2.0.2
numpy 1.18.1
pandas 0.25.3
seaborn 0.9.0
We welcome contributions to the repository and bug reports on GitHub.
For questions and comments, please contact edouard.fouche@kit.edu
, or open an issue.
-
We propose a standalone, deployment-ready version of MCDE in this repository and previous experiments here.
-
We developed a data generator for these experiments, which we released independently here.
- This work was supported by the DFG Research Training Group 2153: ‘Energy Status Data – Informatics Methods for its Collection, Analysis and Exploitation’ and the German Federal Ministry of Education and Research (BMBF) via Software Campus (01IS17042). We thank the pyrolysis team of the Bioliq process for providing the data for our real-world use case (see also https://www.bioliq.de).