Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

driver_artifacts sometimes not written with concurrent trails execution #48757

Open
karstenddwx opened this issue Nov 15, 2024 · 1 comment
Open
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) tune Tune-related issues

Comments

@karstenddwx
Copy link
Contributor

karstenddwx commented Nov 15, 2024

FileNotFoundError: [Errno 2] No such file or directory: [result.json, progress.csv, tfevents]

Sometimes all output of all used LoggerCallback's (result.json/progress.csv/tfevents) are not written for a few trials. So it is not deterministic behavior which is mostly related to concurrency.
That's why, I tried running trails sequentially (concurrency=False) and the problem went away.

Unfortunately, I had to apply a fix on my end in _create_default_callbacks to get trials running sequentially (concurrency=False). There is a bug in _create_default_callbacks adding ProgressReporter multiple times. (tested with 2.35.0 only)

So neither concurrent trials nor sequential trials are running w/o problems. Is there any workaround to get trials finished and all logs written safely?

Issue Severity

Blocker

ray 2.35.0 (AirEntrypoint.TUNE_RUN_EXPERIMENTS)
python 3.9
Red Hat 9.4

Originally posted by @karstenddwx in #46607 (comment)

@karstenddwx karstenddwx changed the title driver_artifacts sometimes not written with concurrent trail execution driver_artifacts sometimes not written with concurrent trails execution Nov 15, 2024
@karstenddwx
Copy link
Contributor Author

Any idea on how to fix concurrent trails execution issue?

@jcotant1 jcotant1 added tune Tune-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component) bug Something that is supposed to be working; but isn't labels Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

2 participants