Project description and milestones

Project description

Datarun goal is to train and test machine learning models. It is a REST API written in Django.
It is designed to work with databoard. Below are the main features of datarun and datboard, since both are designed to work together for now.

Databoard: Main interface for participants.

receives code submitted by participants
sends submitted code and fold indices to datarun
receives predictions from datarun
evaluates score and contributivity of the received predictions (in the case of data are not sensitive)
shows these scores and contributivity on the leaderboard
possibly shows code from all participants

The idea is to have one webservice gathering all RAMPs (for now, one webservice is deployed for each RAMP). It also means a unified database for all RAMPs.

Datarun: API to evaluate models submitted using databoard.

receives data from databoard (in the case of data are not sensitive)
splits dataset into train and test sets
receives submission code and indices of cv fold
possibly receives constraints on CPU time and memory usage from databoard
trains submitted code on indices of cv fold and computes predictions on test dataset
monitors efficiency of submitted code (CPU time and memory consumption)
sends predictions and monitoring metrics to databoard

The idea is to have a service training and testing models on folds (maybe in docker containers).
In the case of sensitive data, this API can be deployed on servers of the data owner.

Databoard admin: Main interface for RAMP admins.

recieves input from admin on the problem setup: workflow elements, score, target column, train/test cut, CV, possibly data link
receives input from admin on the RAMP setup: who can participate, opening/closing/public opening dates, frequency of submissions, etc.
receives a set of servers with resource limits (number of CPUs/GPUs, type of machines, memory) with ssh key
sends setup information to Datarun: data, etc.
alternatively, if data is sensitive, the admin sets up the data on the servers using a limited databoard client/script

Datarun Milestones

v0.1: celery and stratuslab version

Spec:

tasks (train and test of a model on a cv fold) managed with celery
one master and several workers on remote machines (the "datarunners")
train and test not run in docker containers
monitoring of train and test (CPU time, memory usage) using the python package resource
monitoring of the datarunners using flower
deployment scripts on stratuslab

Expected due date: end of March 2016

For later versions

Problem to address:

security

Other features:

easy deployment on different architectures. Difficult task.
- Reims
- AWS?
- Using SplipStream?
- like a software on one machine
- ??
Dealing with GPU
train and test models in different languages.

Improving the Technology

It is hard to plan the technology now, some possible implementations:

job manager HTcondor
celery-flower-kubernetes. In this case, we would contanerize the tasks using docker. Advantages: opening the submission to other languages, monitoring. Disadvantages: security?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly