Skip to content

Project description and milestones

camillemarini edited this page Mar 17, 2016 · 6 revisions

Project description

Datarun goal is to train and test machine learning models. It is a REST API written in Django.
It is designed to work with databoard. Below are the main features of datarun and datboard, since both are designed to work together for now.

Databoard: Main interface for participants.

  • receives code submitted by participants
  • sends submitted code and fold indices to datarun
  • receives predictions from datarun
  • evaluates score and contributivity of the received predictions (in the case of data are not sensitive)
  • shows these scores and contributivity on the leaderboard
  • possibly shows code from all participants

The idea is to have one webservice gathering all RAMPs (for now, one webservice is deployed for each RAMP). It also means a unified database for all RAMPs.

Datarun: API to evaluate models submitted using databoard.

  • receives data from databoard (in the case of data are not sensitive)
  • splits dataset into train and test sets
  • receives submission code and indices of cv fold
  • possibly receives constraints on CPU time and memory usage from databoard
  • trains submitted code on indices of cv fold and computes predictions on test dataset
  • monitors efficiency of submitted code (CPU time and memory consumption)
  • sends predictions and monitoring metrics to databoard

The idea is to have a service training and testing models on folds (maybe in docker containers).
In the case of sensitive data, this API can be deployed on servers of the data owner.

Databoard admin: Main interface for RAMP admins.

  • recieves input from admin on the problem setup: workflow elements, score, target column, train/test cut, CV, possibly data link
  • receives input from admin on the RAMP setup: who can participate, opening/closing/public opening dates, frequency of submissions, etc.
  • receives a set of servers with resource limits (number of CPUs/GPUs, type of machines, memory) with ssh key
  • sends setup information to Datarun: data, etc.
  • alternatively, if data is sensitive, the admin sets up the data on the servers using a limited databoard client/script

Datarun Milestones

v0.1: celery and stratuslab version

Spec:

  • tasks (train and test of a model on a cv fold) managed with celery
  • one master and several workers on remote machines (the "datarunners")
  • train and test not run in docker containers
  • monitoring of train and test (CPU time, memory usage) using the python package resource
  • monitoring of the datarunners using flower
  • deployment scripts on stratuslab

Expected due date: end of March 2016

For later versions

Problem to address:

  • security

Other features:

  • easy deployment on different architectures. Difficult task.
    • Reims
    • AWS?
    • Using SplipStream?
    • like a software on one machine
    • ??
  • Dealing with GPU
  • train and test models in different languages.

Improving the Technology

It is hard to plan the technology now, some possible implementations:

  • job manager HTcondor
  • celery-flower-kubernetes. In this case, we would contanerize the tasks using docker. Advantages: opening the submission to other languages, monitoring. Disadvantages: security?