Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoObserver possible race condition for multiple runs #452

Closed
F-Barto opened this issue Apr 8, 2019 · 5 comments
Closed

MongoObserver possible race condition for multiple runs #452

F-Barto opened this issue Apr 8, 2019 · 5 comments

Comments

@F-Barto
Copy link

F-Barto commented Apr 8, 2019

Hi and thanks for your awesome work,

It seems MongoObserver have some race conditionswhen logging metrics.

Context:

  • We use a scheduler to run our experiments on a cluster
  • Each experiment is run in its own docker
  • Possibly, multiple experiments are running at the same time (each in their own docker) and accessing the same DB

The issue:
We ran ~100 experiments (same code, just different hyperparams for the architecture). When looking at the results in Omniboard some experiments seem to have overlapped metrics plots:
exp_overlap

while some are ok:
exp_nooverlap

When digging a bit with pymongo it appears that the one with weird plot have their run['info'] dict empty:

{}

While the ones with 'ok plots' have their info dicts

{'metrics': [{'id': '5ca213657ed6d385b4b7c0aa', 'name': 'epoch_loss'},
  {'id': '5ca213657ed6d385b4b7c0ac', 'name': 'top1_acc'},
  {'id': '5ca213657ed6d385b4b7c0ae', 'name': 'top5_acc'},
  {'id': '5ca213657ed6d385b4b7c0b0', 'name': 'val_epoch_loss'},
  {'id': '5ca213657ed6d385b4b7c0b2', 'name': 'val_top1_acc'},
  {'id': '5ca213657ed6d385b4b7c0b4', 'name': 'val_top5_acc'},
  {'id': '5ca213657ed6d385b4b7c0b6', 'name': 'lr'},
  {'id': '5ca216687ed6d385b4b7e1dd', 'name': 'test_top1_acc'},
  {'id': '5ca216687ed6d385b4b7e1df', 'name': 'test_top5_acc'}]}
  • The code is the same between all experiments
  • The issue appears to be totally random, so really hard to reproduce
  • The run['info'] dict is never modified in my code
  • I don't know if the run['info'] dict is erased at the end of training of if it is downright not created at the begining of the experiment
  • The issue only appears when we run a lot of experiments in parallel
  • The weird thing is that we have all the experiments with different ids, so no 'overlap' strictly speaking
  • So the overlap display may come from Omniboard which use random metrics as display (because how the hell does it find metrics when the fs ids are not present ?!)
  • Still the erasing of the metrics is an issue

possible related issues:
#309
#345
#317

@flukeskywalker
Copy link

I have faced this exact issue, but in my case it really was an ID overlap issue. To confirm you can try the hack I usually use, demonstrated here: #441 (comment)

Basically, I insert a random delay between the creation of Sacred experiments to make sure that there are no overlapping experiment IDs. I have not faced the issue in question since I started using this trick.

It is possible that the delay range needs to be bigger if you are launching a lot more experiments. For a few hundred experiments, a delay range of 0 to 60 seconds works fine for me.

@F-Barto
Copy link
Author

F-Barto commented Apr 10, 2019

Hi, thanks for the proposition.

I think the overlap display comes from Omniboard which appears to display random plots when run['info'] ={}. Because how the hell does Omniboard find metrics to display when the gfs ids of the metrics are not present ?!. The real issue is the erasing of the metrics. So it would comes from the update of the DB, or something like that I guess, not the run id itself.

@vnmabus
Copy link
Contributor

vnmabus commented Apr 10, 2019

I have faced this exact issue, but in my case it really was an ID overlap issue. To confirm you can try the hack I usually use, demonstrated here: #441 (comment)

Basically, I insert a random delay between the creation of Sacred experiments to make sure that there are no overlapping experiment IDs. I have not faced the issue in question since I started using this trick.

It is possible that the delay range needs to be bigger if you are launching a lot more experiments. For a few hundred experiments, a delay range of 0 to 60 seconds works fine for me.

For me, Mongo ID collisions disappeared after the merge of #254, without resorting to such hacks. Maybe you could try disabling them and reporting a new bug if you ever see ID collisions again.

@flukeskywalker
Copy link

@F-Barto: I doubt that that's the issue. The metrics don't have much to do with the info. See here: https://sacred.readthedocs.io/en/latest/collected_information.html#metrics-records

You can directly inspect the metrics for the problematic runs using pymongo. If you still see the same issue, the problem is not coming from Omniboard, and it is likely an ID issue (make sure that you're using the latest Sacred). If your problem is indeed coming from Omniboard, it is probably best to open an issue in the corresponding repo.

@vnmabus: Yes that's been on my TODO list, and I'll check soon.

@F-Barto
Copy link
Author

F-Barto commented May 7, 2019

Okay finally found it,

The person having the issue was using pymongo and not omniboard to delete the runs. At the same time, he did not delete the corresponding documents in the metrics collection. Hence the overlap of id on the metrics at some point.

Still, the fact that run['info'] is empty when the metrics documents already exist is weird.

Thx for your help all

@F-Barto F-Barto closed this as completed May 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants