[tune] tune.ray() gives repeated status without any further execution #17359

Rashmikoparde · 2021-07-27T07:26:17Z

I am using tune.ray() for hyperparameter tuning in Pycharm. When I execute the script file my code outputs the status repeatedly and nothing further gets executed.

PyCharm : - 2020.2.3 edition
Python :- 3.6.8
torch: - 1.9.0
ray:- 1.5.0

Any insights will be helpful.

krfricke · 2021-07-27T09:00:41Z

What most likely happens here is that you request resources for your trials that cannot be fulfilled by your cluster. Can you provide your training code (or at least the call to tune.run())?

The problem with Tune here is that we're currently not throwing any warning when resource requests cannot be fulfilled. Ray Tune waits forever for resources that will never be available. This is the same underlying issue as here: #16425 - we're working on resolving this. However, this will only fix the warning - to make your code run you will have to either adjust your resource requests or add more resources to the cluster.

krfricke · 2021-07-27T09:02:43Z

Also cc @xwjiang2010 who is working on fixing the warning message

Rashmikoparde · 2021-07-27T09:04:07Z

Here is my code

ray.init()

pbt = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="valid_acc",
    mode='max',
    perturbation_interval=3,
    custom_explore_fn=explore,
    log_config=True)

tune.run(
    RayModel,
    name=hparams['ray_name'],
    scheduler=pbt,
    reuse_actors=True,
    verbose=True,
    checkpoint_score_attr="valid_acc",
    checkpoint_freq=FLAGS.checkpoint_freq,
    resources_per_trial={"gpu": 0.15, "cpu": 2},
    stop={"training_iteration": hparams['num_epochs']},
    config=hparams,
    local_dir=FLAGS.ray_dir,
    num_samples=16
)

@xwjiang2010

Rashmikoparde · 2021-07-27T09:24:53Z

I specified GPU as 0. I get the following error.

Failure # 1 (occurred at 2021-07-27_11-22-36)
Traceback (most recent call last):
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\trial_runner.py", line 739, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\ray_trial_executor.py", line 729, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray_private\client_mode_hook.py", line 82, in wrapper
return func(*args, **kwargs)
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\worker.py", line 1564, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NotImplementedError): �[36mray::RayModel.train_buffered()�[39m (pid=14856, ip=192.168.178.64)
File "python\ray_raylet.pyx", line 534, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 484, in ray._raylet.execute_task.function_executor
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray_private\function_manager.py", line 563, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\trainable.py", line 178, in train_buffered
result = self.train()
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\trainable.py", line 237, in train
result = self.step()
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\trainable.py", line 659, in step
raise NotImplementedError
NotImplementedError

@krfricke @xwjiang2010

krfricke · 2021-07-27T09:40:01Z

So the initial problem was fixed by setting GPUs to 0 - your machine doesn't seem to have a GPU, or it was not detected by the system (e.g. CUDA).

The error you're currently seeing stems from a trainable class that does not implement all abstract methods. Your trainable class should implement at least a setup and a step method. See the class API reference here: https://docs.ray.io/en/master/tune/api_docs/trainable.html#trainable-class-api

You can share your RayModel class if you'd like.

Rashmikoparde · 2021-07-27T09:48:30Z

This is my RayModel Class
@krfricke
The code is referred from https://github.com/jamestszhim/modals

`class RayModel(tune.Trainable):
    def _setup(self, *args):
        self.trainer = TextModelTrainer(self.config)

    def _train(self):
        print(f'Starting Ray Iteration: {self._iteration}')
        train_acc, valid_acc = self.trainer.run_model(self._iteration)
        test_acc, test_loss = self.trainer._test(self._iteration, mode='test')
        return {'train_acc': train_acc, 'valid_acc': valid_acc, 'test_acc': test_acc}

    def _save(self, checkpoint_dir):
        print(checkpoint_dir)
        path = self.trainer.save_model(checkpoint_dir, self._iteration)
        print(path)
        return path

    def _restore(self, checkpoint_path):
        self.trainer.load_model(checkpoint_path)

    def reset_config(self, new_config):
        self.config = new_config
        self.trainer.reset_config(self.config)
        return True`

krfricke · 2021-07-27T09:53:11Z

It seems your RayModel class is based on an old and deprecated API.

You can fix it like this:

Rename _setup to setup
Rename _train to step
Rename _save to save_checkpoint
Rename _restore to load_checkpoint

Rashmikoparde · 2021-07-27T10:07:35Z

It's working now. Thanks.
Is this because of the higher version of Ray?

krfricke · 2021-07-27T10:20:42Z

Yes - we deprecated the old Trainable classes about a year ago (on July 1st 2020, here d35f0e4#diff-e1d889098f6b27e0d88ba206b0689d77c1a320d58697d98933decde97fd3cac8) and threw a deprecation warning since then.

Glad we could resolve your problem - I'll close this issue, but feel free to add to it or reopen if any questions remain.

Rashmikoparde added the enhancement Request for new feature and/or capability label Jul 27, 2021

Rashmikoparde changed the title ~~[tune] tune.ray() stuks~~ [tune] tune.ray() gives repeated status without any further execution Jul 27, 2021

krfricke assigned krfricke and xwjiang2010 Jul 27, 2021

krfricke closed this as completed Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] tune.ray() gives repeated status without any further execution #17359

[tune] tune.ray() gives repeated status without any further execution #17359

Rashmikoparde commented Jul 27, 2021

krfricke commented Jul 27, 2021

krfricke commented Jul 27, 2021

Rashmikoparde commented Jul 27, 2021 •

edited

Loading

Rashmikoparde commented Jul 27, 2021 •

edited

Loading

krfricke commented Jul 27, 2021

Rashmikoparde commented Jul 27, 2021 •

edited

Loading

krfricke commented Jul 27, 2021

Rashmikoparde commented Jul 27, 2021

krfricke commented Jul 27, 2021 •

edited

Loading

[tune] tune.ray() gives repeated status without any further execution #17359

[tune] tune.ray() gives repeated status without any further execution #17359

Comments

Rashmikoparde commented Jul 27, 2021

krfricke commented Jul 27, 2021

krfricke commented Jul 27, 2021

Rashmikoparde commented Jul 27, 2021 • edited Loading

Rashmikoparde commented Jul 27, 2021 • edited Loading

krfricke commented Jul 27, 2021

Rashmikoparde commented Jul 27, 2021 • edited Loading

krfricke commented Jul 27, 2021

Rashmikoparde commented Jul 27, 2021

krfricke commented Jul 27, 2021 • edited Loading

Rashmikoparde commented Jul 27, 2021 •

edited

Loading

Rashmikoparde commented Jul 27, 2021 •

edited

Loading

Rashmikoparde commented Jul 27, 2021 •

edited

Loading

krfricke commented Jul 27, 2021 •

edited

Loading