Implement pre-processing and training on LETOR dataset #128

jfomhover · 2021-10-27T19:06:20Z

The goal of this task is to reproduce results observed in the paper LightGBM: A Highly Efficient Gradient Boosting
Decision Tree and test LightGBM on a publicly available and well known benchmark dataset, to ensure our benchmark reproducibility. We already have a generic training script for LightGBM, so this task will consists in writing a pre-processor for this particular dataset, identify the right parameters for running LightGBM on this sample dataset, and run in AzureML.

The expected impact of this task to:

establish trust in our benchmark by obtaining comparable results with existing reference benchmarks
increase value of this benchmark for the community by providing reproducible results on standard data

Learning Goals

By working on this project you'll be able to learn:

how to write components and pipelines for AzureML (component sdk + shrike)
how to use lightgbm in practice on a sample dataset
how to use mlflow and AzureML run history to report metrics

Expected Deliverable:

To complete this task, you need to deliver:

a working python script to parse original LETOR dataset to feed into LightGBM
a working AzureML component
[stretch] a working pipeline with pre-processing and training, reporting training metrics

Instructions

Prepare for coding

Follow the installation process, please report any issue you meet, that will help!
Clone this repo, create your own branch username/letor (or something) for your own work (commit often!).
In src/scripts/ create a folder preprocess_letor/ and copy the content of src/scripts/samples/ in it.
Download the LETOR dataset from the original source, unzip it if necessary and put it in a subfolder under data/ at the root of the repo (git ignored).

Local development

Let's start locally first...

WORK IN PROGRESS

Develop for AzureML

WORK IN PROGRESS

The text was updated successfully, but these errors were encountered:

jfomhover added enhancement New feature or request good first issue Good for newcomers labels Oct 27, 2021

jfomhover mentioned this issue Oct 27, 2021

Write a sample component + pipeline + developer instructions to support small contrib projects #129

Open

jfomhover added this to the Expansion milestone Nov 5, 2021

jfomhover added inferencing-benchmark training-benchmark and removed inferencing-benchmark labels Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement pre-processing and training on LETOR dataset #128

Implement pre-processing and training on LETOR dataset #128

jfomhover commented Oct 27, 2021 •

edited

Loading

Implement pre-processing and training on LETOR dataset #128

Implement pre-processing and training on LETOR dataset #128

Comments

jfomhover commented Oct 27, 2021 • edited Loading

Learning Goals

Expected Deliverable:

Instructions

Prepare for coding

Local development

Develop for AzureML

jfomhover commented Oct 27, 2021 •

edited

Loading