You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FileNotFoundError: [Errno 2] No such file or directory: [result.json, progress.csv, tfevents]
Sometimes all output of all used LoggerCallback's (result.json/progress.csv/tfevents) are not written for a few trials. So it is not deterministic behavior which is mostly related to concurrency.
That's why, I tried running trails sequentially (concurrency=False) and the problem went away.
Unfortunately, I had to apply a fix on my end in _create_default_callbacks to get trials running sequentially (concurrency=False). There is a bug in _create_default_callbacks adding ProgressReporter multiple times. (tested with 2.35.0 only)
So neither concurrent trials nor sequential trials are running w/o problems. Is there any workaround to get trials finished and all logs written safely?
Issue Severity
Blocker
ray 2.35.0 (AirEntrypoint.TUNE_RUN_EXPERIMENTS)
python 3.9
Red Hat 9.4
The text was updated successfully, but these errors were encountered:
karstenddwx
changed the title
driver_artifacts sometimes not written with concurrent trail execution
driver_artifacts sometimes not written with concurrent trails execution
Nov 15, 2024
Any idea on how to fix concurrent trails execution issue?
jcotant1
added
tune
Tune-related issues
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
bug
Something that is supposed to be working; but isn't
labels
Nov 17, 2024
FileNotFoundError: [Errno 2] No such file or directory: [result.json, progress.csv, tfevents]
Sometimes all output of all used LoggerCallback's (result.json/progress.csv/tfevents) are not written for a few trials. So it is not deterministic behavior which is mostly related to concurrency.
That's why, I tried running trails sequentially (concurrency=False) and the problem went away.
Unfortunately, I had to apply a fix on my end in _create_default_callbacks to get trials running sequentially (concurrency=False). There is a bug in _create_default_callbacks adding ProgressReporter multiple times. (tested with 2.35.0 only)
So neither concurrent trials nor sequential trials are running w/o problems. Is there any workaround to get trials finished and all logs written safely?
Issue Severity
Blocker
ray 2.35.0 (AirEntrypoint.TUNE_RUN_EXPERIMENTS)
python 3.9
Red Hat 9.4
Originally posted by @karstenddwx in #46607 (comment)
The text was updated successfully, but these errors were encountered: