Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PruningCallback doesn't work #20

Closed
himkt opened this issue Nov 14, 2020 · 6 comments
Closed

PruningCallback doesn't work #20

himkt opened this issue Nov 14, 2020 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@himkt
Copy link
Owner

himkt commented Nov 14, 2020

Apart from #18

#18 (comment)

@himkt
Copy link
Owner Author

himkt commented Nov 14, 2020

@vikigenius

Thank you so much for diving into allennlp-optuna.

Which storage do you use? (should be one of sqlite3, MySQL, PostgreSQL, Redis)
And could you please share with me a simple reproducible configuration?

@vikigenius
Copy link

Thanks for creating the issue @himkt I use the default sqlite3 storage

local model_name = "models/distilroberta-base-msmarco-v1/0_Transformer";
local num_gpus = 8;
local data_base_url = "data/mydata/processed/";
local batch_size = std.parseInt(std.extVar('batch_size'));
local lr = std.parseJson(std.extVar('lr'));
local model = "my_model";
local dataset_reader = "my_reader";

{
  "train_data_path": data_base_url + "train.tsv.part*",
  "validation_data_path": data_base_url + "valid.tsv.part*",
  "dataset_reader": {
    "type": "sharded",
    "base_reader": {
      "type": dataset_reader,
      "query_tokenizer": {
        "type": "pretrained_transformer",
        "model_name": model_name,
        "max_length": 500,
      },
      "query_token_indexers": {
        "tokens": {
          "type": "pretrained_transformer",
          "model_name": model_name,
          "namespace": "tokens"
        }
      },
    }
  },
  'model': {
    'type': model,
    'transformer_model': model_name,
  },
  "data_loader": {
    "batch_size": batch_size,
    "shuffle": true
  },
  "distributed": {
    "cuda_devices": if num_gpus > 1 then std.range(0, num_gpus - 1) else 0,
  },
  "trainer": {
    "num_epochs": 10,
    "optimizer": {
      "type": "huggingface_adamw",
      "lr": lr,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "correct_bias": true
    },
    "learning_rate_scheduler": {
      "type": "polynomial_decay",
    },
    "use_amp": true,
    "grad_norm": 1.0,
    "validation_metric": "+rec1",
    "epoch_callbacks": [
      {
        "type": "optuna_pruner"
      }
    ]
  }
}

This was the config I was using. You would have to change the models and dataset readers, I can try to reproduce with a simpler example with predefined models etc, but it would take me a while since I won't be using the multi GPU cluster for some time.

@himkt himkt mentioned this issue Nov 14, 2020
@himkt
Copy link
Owner Author

himkt commented Nov 14, 2020

@vikigenius Thank you for your help.

Let me ask a question: does this configuration work well if it runs on a single GPU? (means that it disables distributed).
The current implementation of AllenNLP integration for a pruning feature may not work with a distributed setting.

If your configuration works on a single GPU, I'll investigate AllenNLP integration in Optuna. But, it may take time because the mechanism for supporting PruningCallback in the integration is relatively complicated (I implemented...) and I don't have a cluster with multi GPUs now.

Sorry for the inconvenience. 🙇

@himkt himkt self-assigned this Nov 14, 2020
@himkt himkt added the bug Something isn't working label Nov 14, 2020
@himkt
Copy link
Owner Author

himkt commented Nov 14, 2020

Related to optuna/optuna#1990.

@himkt
Copy link
Owner Author

himkt commented Jul 23, 2021

FYI @vikigenius

I'm working on the entirely refactoring AllenNLP integration in Optuna (optuna/optuna#2796).
After this PR being merged, PruningCallback would work with distributed training.

@himkt
Copy link
Owner Author

himkt commented Dec 9, 2021

In the Optuna v3.0.0a0, we finally introduced the support for the pruning callback in distributed training.
https://github.com/optuna/optuna/releases/tag/v3.0.0-a0

pip install -U optuna==3.0.0a0

@himkt himkt closed this as completed Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants