Skip to content

Roadmap

vfdev edited this page Apr 19, 2021 · 16 revisions

This document lists general directions that core team is interested to see developed in PyTorch-Ignite.

We are using Github Projects to define our different goals: releases, particular milestones etc.

Principal goals

  • continue maintaining high-quality, well-tested and documented modules.
  • provide distributed framework support via ignite.distributed: XLA (e.g. TPU), Horovod
  • provide new higher-level API based on Engine to simplify the usage while keeping flexibility as a contrib module
  • provide helper on data management via ignite.data: sampling, multi-dataloaders
  • provide more intergrations with other tools to simplify Machine/Deep Learning end-to-end applications.
  • visibility and communications

Codebase maintenance

  • add typing to the whole package
  • adapt the code and add mypy check
  • merge contrib module into principal library ?

Pre-built Docker images

Distributed framework support

Metrics

  • All metrics work in distributed
    • configurable distributed metrics reduce/gather methods
  • Minor improvements:
    • better support of sklearn metrics
    • Classification metrics with micro/macro options
  • Metrics for NLP: ROUGE, BLEU, METEOR, PPL
  • Metrics for GANs: FID, PPL (#998)

See also related GSoC 2021 project idea description

Higher-level API

  • push-button contrib trainers with AMP, distributed etc
  • automatic batch size via toma

See also related GSoC 2021 project idea description

Refactor Engine

Engine of 0.4.x version contains several major bugs related to the way we implemented events triggering and counting. In this case, events filtering requires state and corresponding attributes to be available which is not a nice design. To solve the following issues : https://github.com/pytorch/ignite/issues?q=is%3Aissue+is%3Aopen+label%3A%22module%3A+engine%22 it requires major Engine redesign while keeping as much as possible the backward compatibility.

Run/Resume logic improvements

Currently, we have a bit unclear engine's behaviour about when restart from the beginning and when to continue.

Currently

# (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7

# error
engine.run(data, max_epochs=4) -> ValueError: Argument max_epochs should be larger than the start epoch

# restart from 0 to 7 (As state.epoch == max_epochs(=7), this should be like that as we always do: evaluator.run(data) without any other instructions)
engine.run(data, max_epochs=7) -> Engine run starting with max_epochs=7 => state.epoch=7

# forced restart from 0 to 5
engine.state.max_epochs = None
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# forced restart from 0 to 9, instead of continue from state.epoch=7
engine.state.max_epochs = None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9

A proposition to change it slightly: "error" case and ugly engine.state.max_epochs=None solution.

Proposed API

# SAME. (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# SAME. continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7

# As max_epochs=4 <= state.epoch=7 => restart
engine.run(data, max_epochs=4) -> Engine run starting with max_epochs=4 => state.epoch=4

# restart from 0 to 4
engine.run(data, max_epochs=4) -> Engine run starting with max_epochs=4 => state.epoch=4

# Now (not forced) restart from 0 to 3 (as max_epochs=3 <= state.epoch=4 => restart)
engine.run(data, max_epochs=3) -> Engine run starting with max_epochs=3 => state.epoch=3

# SOMETHING TO CHANGE HERE. Forced restart from 0 to 9, instead of continue from state.epoch=3
engine.state.max_epochs = None  # maybe, engine.reset() -> state.epoch=state.iteration=0,state.max_epochs=state.max_iters=None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9

# In case of max_iters, we'll have to do:
engine.run(data, max_iters=100) -> Engine run starting with max_iters=100 => state.iteration=100
engine.state.max_iters = None
engine.run(data, max_iters=100) -> Engine run starting with max_iters=100 => state.iteration=100
# So there is no uniform API to restart engine...

Helper on data management

  • better and simple coverage of multi-dataloaders use-cases, e.g. GAN, SSL, etc

Integrations

  • Verify compatibility (if ignite is not blocking) writing applications for Federated Learning
  • Verify compatibility (if ignite is not blocking) writing applications with Distributed RPC framework

Communications

  • More applications and successful stories with PyTorch-Ignite
  • Showcase via ClearML Ignite server :
    • more experiments with Ignite from our users