-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets cache folder not shared between users #671
Comments
@mralgos do I understand it correctly that more than one user is trying to access this dataset from the same workstation? |
@jkhenning yes, correct. It is possible that multiple users need to read the dataset cached in the same workstation |
@mralgos in that case it would seem to be a Linux permissions that might be outside of the scope of ClearML code - how would you expect it to work? As far as I know the lock is required to make sure multiple writers don't compete on the same file |
I'd expect that if someone is writing the dataset, the lock folder must be created in order to prevent other writing ops of the same dataset. However, if the dataset exists already, multiple users should be able to read it (hence the lock wouldn't be necessary). I'm going to check the permission and the umask settings in the meantime. |
@jkhenning After a bit of debugging I have found that setting the umask to 002 (i.e. default write permission to unix group) the problem is only partially resolved: while the What is your expected behaviour when a user starts two experiments that require the same dataset on the same server and the cache is not yet built? This is the case when two processes start downloading the datasets in the same cache location at the same time. What I'm seeing is a bit unpredictable:
As a final note, having the umask set to 002 creates other headaches because in our multi-user setup it means that any user belonging to the same unix group can write on other's files. |
@jkhenning I've put together a change that would potentially fix the issue (assuming I'm not missing something in the overall design). Basically, two main changes:
Hope this helps. Let me know what you think. |
Hi @mralgos! Can you please open a PR for your fix? I think it looks good |
Hi @mralgos, thanks for the contribution! Closing this issue 🙂 |
Hello,
a discussion on this issue started on Slack here.
I have a ClearML server hosted on AWS with web authentication enabled. Each ML person has:
The config file defines the path to the cache folder via:
We have some datasets registered by the ClearML server and the codebase uses
get_local_copy()
to download the data into the machine. The problem manifests when two or more people wants to access (read, i.e. the cache exists already and isn't corrupted) the dataset.The execution fails with this error:
The text was updated successfully, but these errors were encountered: