Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release][Tune] tune_cloud_aws_durable_upload failed after restoring experiment #32842

Closed
justinvyu opened this issue Feb 25, 2023 · 1 comment
Assignees
Labels
release-test release test tune Tune-related issues

Comments

@justinvyu
Copy link
Contributor

This PR recently modified this release test: #32334. However, the test failure doesn't seem to be related to the PR at all -- the assertions added by the PR are passing, as shown in the logs. This may be a one-off test failure, but tracking this here in case a regression has been introduced.

The test gets through the initial run of the experiment + between_experiments assertions pass. After restoring, we get the following error:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 542, in _wait_and_handle_event
    self._on_pg_ready(next_trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 637, in _on_pg_ready
    assert next_trial is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "workloads/_tune_script.py", line 159, in <module>
    run_tune(**run_kwargs)
  File "workloads/_tune_script.py", line 119, in run_tune
    **kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 786, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 602, in step
    self._wait_and_handle_event(next_trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 581, in _wait_and_handle_event
    raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 542, in _wait_and_handle_event
    self._on_pg_ready(next_trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 637, in _on_pg_ready
    assert next_trial is not None
AssertionError

Process started, trials are running
Waiting 90 seconds until sending signal 10 to process 2645
(raylet) *** SIGTERM received at time=1677194273 on cpu 1 ***
(raylet) PC: @     0x7f5e6626d46e  (unknown)  epoll_wait
(raylet)     @     0x7f5e664ae420  (unknown)  (unknown)
(raylet)     @          0x1b7ffa0  (unknown)  (unknown)
(raylet) [2023-02-23 15:17:53,910 E 2762 2762] logging.cc:361: *** SIGTERM received at time=1677194273 on cpu 1 ***
(raylet) [2023-02-23 15:17:53,910 E 2762 2762] logging.cc:361: PC: @     0x7f5e6626d46e  (unknown)  epoll_wait
(raylet) [2023-02-23 15:17:53,911 E 2762 2762] logging.cc:361:     @     0x7f5e664ae420  (unknown)  (unknown)
(raylet) [2023-02-23 15:17:53,913 E 2762 2762] logging.cc:361:     @          0x1b7ffa0  (unknown)  (unknown)
(raylet) [2023-02-23 15:17:53,915 E 2762 2818] core_worker.cc:3397: Mismatched ActorID: ignoring KillActor for previous actor 68bc17314e9543a2f671d63805000000, current actor ID: NIL_ID
Traceback (most recent call last):
  File "workloads/run_cloud_test.py", line 1459, in <module>
    raise err
  File "workloads/run_cloud_test.py", line 1411, in <module>
    args.variant, args.trainable, run_time, bucket, args.cpus_per_trial
  File "workloads/run_cloud_test.py", line 1393, in _run_test
    test_durable_upload(bucket)
  File "workloads/run_cloud_test.py", line 1329, in test_durable_upload
    after_experiments_callback=after_experiments,
  File "workloads/run_cloud_test.py", line 400, in run_resume_flow
    upload_dir=upload_dir,
  File "workloads/run_cloud_test.py", line 328, in run_tune_script_for_time
    send_signal_after_wait(process, signal=signal.SIGUSR1, wait=run_time)
  File "workloads/run_cloud_test.py", line 279, in send_signal_after_wait
    f"Process {process.pid} already terminated. This usually means "
RuntimeError: Process 2645 already terminated. This usually means that some of the trials ERRORed (e.g. because they couldn't be restored. Try re-running this test to see if this fixes the issue.
Subprocess return code: 1

See full logs here: https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_lb6qiakb5g3s7n6yz38sxl28yt?pjd-section=last-log

@justinvyu justinvyu added tune Tune-related issues release-test release test labels Feb 25, 2023
@justinvyu justinvyu self-assigned this Feb 25, 2023
@justinvyu
Copy link
Contributor Author

Closing for now. Hasn't failed since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-test release test tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

1 participant