Independent Laboratory Work - Epico

This project was created for my master's independent laboratory work. The aim of the project was to create a tool which helps me run custom simulations on newly created datasets of self-made random dataset generator. Through the results of simulations, I try to conclude how each chosen Machine Learning model behaves in a changed environment/dataset.

Why did I create a custom Random Dataset Generator?

I needed to create the generator because:

If I use only one dataset, the simulations will not give unforeseen results.
I needed a tool which guarantees randomness in value creation, easily customizable, and can generate large number of datasets for my simulations.

How did I create the Random Dataset Generator?

First of all, to be able to guarantee randomness, I needed to use Monte Carlo sampling, which lets me create values in a random manner. This can be achieved in C++ by using the Mersenne twister engine.

Secondly, to be able to customize the dataset structure, I implemented several distributions, which can be seen in the following table.

Name of distribution	Parameters
Binomial	number of trials, probability, weight
Bernoulli	probability, weight
Normal	mean, standard deviation, weight
Uniform Discrete	from, to, weight
Uniform Real	from, to, weight
Gamma	alpha, beta, weight

Lastly, to be able to use binary-classification on the generated datasets, I needed binary output column. This could be achieved by using logistic regression's logit function.

Why did I create custom simulations?

Goals of creating custom simulations were

to scale up the number of simulations that were used during my work
to create a plug&play solver that can be easily customized for my needs
to see how each covariates influence the used machine learning models performing ability

Types of custom simulations

Without column excluding:
- measures the influence of all covariates
- fits Machine Learning model on all the covariates
With column excluding:
- measures the influence of covariates separately
- excludes one column in each iteration
- fits Machine Learning model on the remaining dataset
- puts back the excluded column at the end

Methodology of research

5 Scenarios
4 Machine Learning models:
- Logistic Regression with default parameters
- Random Forest with default parameters
- Random Forest by hyperparameter optimized for accuracy
- Random Forest by hyperparameter optimized for AUC value of ROC analysis

6. Results

The documentation can be found in the docs folder.

7. Structure of the project

epico-cpp folder contains the implementation of the Random Dataset Generator
epico-python folder contains the implementation of custom simulations, and data vizualization files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Independent Laboratory Work - Epico

Table of Contents

Why did I create a custom Random Dataset Generator?

How did I create the Random Dataset Generator?

Why did I create custom simulations?

Types of custom simulations

Methodology of research

6. Results

7. Structure of the project

License & copyright

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
docs		docs
epico-cpp		epico-cpp
epico-python		epico-python
LICENSE		LICENSE
README.md		README.md

License

tothpeti/Epico

Folders and files

Latest commit

History

Repository files navigation

Independent Laboratory Work - Epico

Table of Contents

Why did I create a custom Random Dataset Generator?

How did I create the Random Dataset Generator?

Why did I create custom simulations?

Types of custom simulations

Methodology of research

6. Results

7. Structure of the project

License & copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages