Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] tune.ray() gives repeated status without any further execution #17359

Closed
Rashmikoparde opened this issue Jul 27, 2021 · 9 comments
Closed
Assignees
Labels
enhancement Request for new feature and/or capability

Comments

@Rashmikoparde
Copy link

I am using tune.ray() for hyperparameter tuning in Pycharm. When I execute the script file my code outputs the status repeatedly and nothing further gets executed.
image

PyCharm : - 2020.2.3 edition
Python :- 3.6.8
torch: - 1.9.0
ray:- 1.5.0

Any insights will be helpful.

@Rashmikoparde Rashmikoparde added the enhancement Request for new feature and/or capability label Jul 27, 2021
@Rashmikoparde Rashmikoparde changed the title [tune] tune.ray() stuks [tune] tune.ray() gives repeated status without any further execution Jul 27, 2021
@krfricke
Copy link
Contributor

What most likely happens here is that you request resources for your trials that cannot be fulfilled by your cluster. Can you provide your training code (or at least the call to tune.run())?

The problem with Tune here is that we're currently not throwing any warning when resource requests cannot be fulfilled. Ray Tune waits forever for resources that will never be available. This is the same underlying issue as here: #16425 - we're working on resolving this. However, this will only fix the warning - to make your code run you will have to either adjust your resource requests or add more resources to the cluster.

@krfricke
Copy link
Contributor

Also cc @xwjiang2010 who is working on fixing the warning message

@Rashmikoparde
Copy link
Author

Rashmikoparde commented Jul 27, 2021

Here is my code

ray.init()

pbt = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="valid_acc",
    mode='max',
    perturbation_interval=3,
    custom_explore_fn=explore,
    log_config=True)

tune.run(
    RayModel,
    name=hparams['ray_name'],
    scheduler=pbt,
    reuse_actors=True,
    verbose=True,
    checkpoint_score_attr="valid_acc",
    checkpoint_freq=FLAGS.checkpoint_freq,
    resources_per_trial={"gpu": 0.15, "cpu": 2},
    stop={"training_iteration": hparams['num_epochs']},
    config=hparams,
    local_dir=FLAGS.ray_dir,
    num_samples=16
)

@xwjiang2010

@Rashmikoparde
Copy link
Author

Rashmikoparde commented Jul 27, 2021

I specified GPU as 0. I get the following error.

Failure # 1 (occurred at 2021-07-27_11-22-36)
Traceback (most recent call last):
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\trial_runner.py", line 739, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\ray_trial_executor.py", line 729, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray_private\client_mode_hook.py", line 82, in wrapper
return func(*args, **kwargs)
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\worker.py", line 1564, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NotImplementedError): �[36mray::RayModel.train_buffered()�[39m (pid=14856, ip=192.168.178.64)
File "python\ray_raylet.pyx", line 534, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 484, in ray._raylet.execute_task.function_executor
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray_private\function_manager.py", line 563, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\trainable.py", line 178, in train_buffered
result = self.train()
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\trainable.py", line 237, in train
result = self.step()
File "C:\Users\rask\Downloads\T\modals-main\venv\lib\site-packages\ray\tune\trainable.py", line 659, in step
raise NotImplementedError
NotImplementedError

@krfricke @xwjiang2010

@krfricke
Copy link
Contributor

So the initial problem was fixed by setting GPUs to 0 - your machine doesn't seem to have a GPU, or it was not detected by the system (e.g. CUDA).

The error you're currently seeing stems from a trainable class that does not implement all abstract methods. Your trainable class should implement at least a setup and a step method. See the class API reference here: https://docs.ray.io/en/master/tune/api_docs/trainable.html#trainable-class-api

You can share your RayModel class if you'd like.

@Rashmikoparde
Copy link
Author

Rashmikoparde commented Jul 27, 2021

This is my RayModel Class
@krfricke
The code is referred from https://github.com/jamestszhim/modals

`class RayModel(tune.Trainable):
    def _setup(self, *args):
        self.trainer = TextModelTrainer(self.config)

    def _train(self):
        print(f'Starting Ray Iteration: {self._iteration}')
        train_acc, valid_acc = self.trainer.run_model(self._iteration)
        test_acc, test_loss = self.trainer._test(self._iteration, mode='test')
        return {'train_acc': train_acc, 'valid_acc': valid_acc, 'test_acc': test_acc}

    def _save(self, checkpoint_dir):
        print(checkpoint_dir)
        path = self.trainer.save_model(checkpoint_dir, self._iteration)
        print(path)
        return path

    def _restore(self, checkpoint_path):
        self.trainer.load_model(checkpoint_path)

    def reset_config(self, new_config):
        self.config = new_config
        self.trainer.reset_config(self.config)
        return True`

@krfricke
Copy link
Contributor

It seems your RayModel class is based on an old and deprecated API.

You can fix it like this:

  • Rename _setup to setup
  • Rename _train to step
  • Rename _save to save_checkpoint
  • Rename _restore to load_checkpoint

@Rashmikoparde
Copy link
Author

It's working now. Thanks.
Is this because of the higher version of Ray?

@krfricke
Copy link
Contributor

krfricke commented Jul 27, 2021

Yes - we deprecated the old Trainable classes about a year ago (on July 1st 2020, here d35f0e4#diff-e1d889098f6b27e0d88ba206b0689d77c1a320d58697d98933decde97fd3cac8) and threw a deprecation warning since then.

Glad we could resolve your problem - I'll close this issue, but feel free to add to it or reopen if any questions remain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability
Projects
None yet
Development

No branches or pull requests

3 participants