Skip to content

Roadmap

vfdev edited this page Feb 9, 2022 · 16 revisions

This document lists general directions that core team is interested to see developed in PyTorch-Ignite.

We are using Github Projects to define our different goals: releases, particular milestones etc.

Principal goals

  • continue maintaining high-quality, well-tested and documented modules.
  • provide distributed framework support via ignite.distributed: XLA (e.g. TPU), Horovod
  • provide new higher-level API based on Engine to simplify the usage while keeping flexibility as a contrib module
  • provide helper on data management via ignite.data: sampling, multi-dataloaders
  • provide more integrations with other tools to simplify Machine/Deep Learning end-to-end applications.
  • visibility and communications

Codebase maintenance

  • add typing to the whole package
  • adapt the code and add mypy check
  • merge contrib module into principal library ?

Pre-built Docker images

Distributed framework support

Metrics

  • All metrics work in distributed
    • configurable distributed metrics reduce/gather methods
  • Minor improvements:
    • better support of sklearn metrics
    • Classification metrics with micro/macro options
  • Metrics for NLP: ROUGE, BLEU, METEOR, PPL
  • Metrics for GANs: FID, PPL (#998)

See also related GSoC 2021 project idea description

Higher-level API

  • push-button contrib trainers with AMP, distributed etc
  • automatic batch size via toma

See also related GSoC 2021 project idea description

Refactor Engine

Engine of 0.4.x version contains several major bugs related to the way we implemented events triggering and counting. In this case, events filtering requires state and corresponding attributes to be available which is not a nice design. To solve the following issues : https://github.com/pytorch/ignite/issues?q=is%3Aissue+is%3Aopen+label%3A%22module%3A+engine%22 it requires major Engine redesign while keeping as much as possible the backward compatibility.

Engine derived from EventsDriven

The idea is to split Engine(Serializable) -> Engine(Serializable, EventsDriven) where EventsDriven is a class responsible for events registration, triggering etc. Thus Engine will have only the logic to register necessary events and about how to run two loops.

Engine.run_one_epoch as a public method

Exposing run_one_epoch publicly would help user to combine their custom outer loops with Engine's one. Required here:

Tricky part is to resume from the stopped iteration if epoch length is not data size or data is an iterable.

Run/Resume logic improvements

Details

Currently, we have a bit unclear engine's behavior about when restart from the beginning and when to continue.

Currently

# (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7

# error
engine.run(data, max_epochs=4) -> ValueError: Argument max_epochs should be larger than the start epoch

# restart from 0 to 7 (As state.epoch == max_epochs(=7), this should be like that as we always do: evaluator.run(data) without any other instructions)
engine.run(data, max_epochs=7) -> Engine run starting with max_epochs=7 => state.epoch=7

# forced restart from 0 to 5
engine.state.max_epochs = None
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# forced restart from 0 to 9, instead of continue from state.epoch=7
engine.state.max_epochs = None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9

A proposition to change it slightly: "error" case and ugly engine.state.max_epochs=None solution.

Proposed API

# SAME. (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# SAME. continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7

# As max_epochs=4 <= state.epoch=7 => restart
engine.run(data, max_epochs=4) -> Engine run starting with max_epochs=4 => state.epoch=4

# restart from 0 to 4
engine.run(data, max_epochs=4) -> Engine run starting with max_epochs=4 => state.epoch=4

# Now (not forced) restart from 0 to 3 (as max_epochs=3 <= state.epoch=4 => restart)
engine.run(data, max_epochs=3) -> Engine run starting with max_epochs=3 => state.epoch=3

# SOMETHING TO CHANGE HERE. Forced restart from 0 to 9, instead of continue from state.epoch=3
engine.state.max_epochs = None  # maybe, engine.reset() -> state.epoch=state.iteration=0,state.max_epochs=state.max_iters=None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9

# In case of max_iters, we'll have to do:
engine.run(data, max_iters=100) -> Engine run starting with max_iters=100 => state.iteration=100
engine.state.max_iters = None
engine.run(data, max_iters=100) -> Engine run starting with max_iters=100 => state.iteration=100
# So there is no uniform API to restart engine...
  • Fix #1521 issue

Pipeline Parallelism support

Helper on data management

  • better and simple coverage of multi-dataloaders use-cases, e.g. GAN, SSL, etc

Integrations

  • Verify compatibility (if ignite is not blocking) writing applications for Federated Learning
  • Verify compatibility (if ignite is not blocking) writing applications with Distributed RPC framework

Communications

  • More applications and successful stories with PyTorch-Ignite
  • Showcase via ClearML Ignite server :
    • more experiments with Ignite from our users