-
Notifications
You must be signed in to change notification settings - Fork 0
Project description and milestones
camillemarini edited this page Mar 17, 2016
·
6 revisions
Datarun goal is to train and test machine learning models. It is a REST API written in Django.
It is designed to work with databoard.
Below are the main features of datarun and datboard, since both are designed to work together for now.
- receives code submitted by participants
- sends submitted code and fold indices to datarun
- receives predictions from datarun
- evaluates score and contributivity of the received predictions (in the case of data are not sensitive)
- shows these scores and contributivity on the leaderboard
- possibly shows code from all participants
The idea is to have one webservice gathering all RAMPs (for now, one webservice is deployed for each RAMP). It also means a unified database for all RAMPs.
- receives data from databoard (in the case of data are not sensitive)
- splits dataset into train and test sets
- receives submission code and indices of cv fold
- possibly receives constraints on CPU time and memory usage from databoard
- trains submitted code on indices of cv fold and computes predictions on test dataset
- monitors efficiency of submitted code (CPU time and memory consumption)
- sends predictions and monitoring metrics to databoard
The idea is to have a service training and testing models on folds (maybe in docker containers).
In the case of sensitive data, this API can be deployed on servers of the data owner.
- recieves input from admin on the problem setup: workflow elements, score, target column, train/test cut, CV, possibly data link
- receives input from admin on the RAMP setup: who can participate, opening/closing/public opening dates, frequency of submissions, etc.
- receives a set of servers with resource limits (number of CPUs/GPUs, type of machines, memory) with ssh key
- sends setup information to Datarun: data, etc.
- alternatively, if data is sensitive, the admin sets up the data on the servers using a limited databoard client/script
Spec:
- tasks (train and test of a model on a cv fold) managed with celery
- one master and several workers on remote machines (the "datarunners")
- train and test not run in docker containers
- monitoring of train and test (CPU time, memory usage) using the python package resource
- monitoring of the datarunners using flower
- deployment scripts on stratuslab
Expected due date: end of March 2016
- security
- easy deployment on different architectures. Difficult task.
- Reims
- AWS?
- Using SplipStream?
- like a software on one machine
- ??
- Dealing with GPU
- train and test models in different languages.
It is hard to plan the technology now, some possible implementations:
- job manager HTcondor
- celery-flower-kubernetes. In this case, we would contanerize the tasks using docker. Advantages: opening the submission to other languages, monitoring. Disadvantages: security?