Skip to content
This repository has been archived by the owner on Apr 8, 2024. It is now read-only.

Implement pre-processing and training on LETOR dataset #128

Open
jfomhover opened this issue Oct 27, 2021 · 0 comments
Open

Implement pre-processing and training on LETOR dataset #128

jfomhover opened this issue Oct 27, 2021 · 0 comments
Labels
enhancement New feature or request good first issue Good for newcomers training-benchmark
Milestone

Comments

@jfomhover
Copy link
Contributor

jfomhover commented Oct 27, 2021

The goal of this task is to reproduce results observed in the paper LightGBM: A Highly Efficient Gradient Boosting
Decision Tree
and test LightGBM on a publicly available and well known benchmark dataset, to ensure our benchmark reproducibility. We already have a generic training script for LightGBM, so this task will consists in writing a pre-processor for this particular dataset, identify the right parameters for running LightGBM on this sample dataset, and run in AzureML.

The expected impact of this task to:

  • establish trust in our benchmark by obtaining comparable results with existing reference benchmarks
  • increase value of this benchmark for the community by providing reproducible results on standard data

Learning Goals

By working on this project you'll be able to learn:

  • how to write components and pipelines for AzureML (component sdk + shrike)
  • how to use lightgbm in practice on a sample dataset
  • how to use mlflow and AzureML run history to report metrics

Expected Deliverable:

To complete this task, you need to deliver:

  • a working python script to parse original LETOR dataset to feed into LightGBM
  • a working AzureML component
  • [stretch] a working pipeline with pre-processing and training, reporting training metrics

Instructions

Prepare for coding

  1. Follow the installation process, please report any issue you meet, that will help!
  2. Clone this repo, create your own branch username/letor (or something) for your own work (commit often!).
  3. In src/scripts/ create a folder preprocess_letor/ and copy the content of src/scripts/samples/ in it.
  4. Download the LETOR dataset from the original source, unzip it if necessary and put it in a subfolder under data/ at the root of the repo (git ignored).

Local development

Let's start locally first...

WORK IN PROGRESS

Develop for AzureML

WORK IN PROGRESS

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request good first issue Good for newcomers training-benchmark
Projects
None yet
Development

No branches or pull requests

1 participant