Distributed Aggregate Variance

This repo contains the code and documentation for the project of the course 2AMD15 - Big Data Management held at TU/e by Prof. Odysseas Papapetrou during the AY 2022/2023

The goal of the project is to use the Spark platform to discover interesting combinations of vectors from a large dataset in a cloud distributed setting, more details in the problem description

Our results are summarized in the poster and described in-depth in the report

Outline

Problem description	Report	Poster

How to use

Run locally

In order to run on your machine

generate a vector.csv dataset with the GenVec.jar utility (see here)
run python3 ./src/2amd15/main.py

Submit to server

In order to prepare for the submission, a app.zip needs to be uploaded to the server, and such archive should contain one single main.py file. The data should be contained in a data.zip archive. Only one question per submission can be uploaded and tested.

The submit.py module takes care of prepare the aforementioned submission artifacts. Here is how to use it:

usage: submit.py [-h] [-s] -q {2,3,4} [-p PASSWORD] [-v] [-r ROWS] [-c COLS] [-f]

Handles the building pipeline of the submission artifact.

options:
  -h, --help            show this help message and exit
  -s, --submit          Proceed with creating a submission on the server after build
  -q {2,3,4}, --question {2,3,4}
                        which question is the submission artifacts about
  -p PASSWORD, --password PASSWORD
                        password of the server, required if -s is passed
  -v, --verbose         verbose flag, sets logging level to debug
  -r ROWS, --rows ROWS  number of vectors in the csv, overwrites default
  -c COLS, --columns COLS
                        length of vectors in the csv, overwrites default
  -f, --full-build      builds a main.py with full code

For instance, here is how you prepare the artifacts and submit for question 3 with reduced dataset:

Unix

python3 ./tools/submit.py -q 3 -s -r 250

Win

python .\tools\submit.py -q 3 -s -r 250

Connect to server

It is possible to connect to the server via sft with the following script and then typing in the password

Unix

./tools/server/connect.sh

Win

.\tools\server\connect.bat

Template information

Description of the project template from the lecturer

This is a template project for team project of 2AMD15 2023.

Various comments in the main.py file indicate where functionality needs to be implemented. These comments are marked TODO.

You are allowed to change the layout of the main.py file (such as method names and signatures) in any way you like, and you can add new functionality (possibly in new .py files). Do make sure that there exists a file called "main.py" which contains the 'main module statement' of the form "if name == 'main':".

You can ZIP the main.py file along with any other .py files using any common compression tool. The archive should be called app.zip and can be uploaded as such to the server.

Good luck on the project!

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
docs		docs
src/2amd15		src/2amd15
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Aggregate Variance

Outline

How to use

Run locally

Submit to server

Connect to server

Template information

About

Releases

Contributors 6

Languages

filippodaniotti/2AMD15-Project

Folders and files

Latest commit

History

Repository files navigation

Distributed Aggregate Variance

Outline

How to use

Run locally

Submit to server

Connect to server

Template information

About

Topics

Resources

Stars

Watchers

Forks

Releases

Contributors 6

Languages