MongoObserver possible race condition for multiple runs #452

F-Barto · 2019-04-08T09:43:10Z

Hi and thanks for your awesome work,

It seems MongoObserver have some race conditionswhen logging metrics.

Context:

We use a scheduler to run our experiments on a cluster
Each experiment is run in its own docker
Possibly, multiple experiments are running at the same time (each in their own docker) and accessing the same DB

The issue:
We ran ~100 experiments (same code, just different hyperparams for the architecture). When looking at the results in Omniboard some experiments seem to have overlapped metrics plots:

while some are ok:

When digging a bit with pymongo it appears that the one with weird plot have their run['info'] dict empty:

{}

While the ones with 'ok plots' have their info dicts

{'metrics': [{'id': '5ca213657ed6d385b4b7c0aa', 'name': 'epoch_loss'},
  {'id': '5ca213657ed6d385b4b7c0ac', 'name': 'top1_acc'},
  {'id': '5ca213657ed6d385b4b7c0ae', 'name': 'top5_acc'},
  {'id': '5ca213657ed6d385b4b7c0b0', 'name': 'val_epoch_loss'},
  {'id': '5ca213657ed6d385b4b7c0b2', 'name': 'val_top1_acc'},
  {'id': '5ca213657ed6d385b4b7c0b4', 'name': 'val_top5_acc'},
  {'id': '5ca213657ed6d385b4b7c0b6', 'name': 'lr'},
  {'id': '5ca216687ed6d385b4b7e1dd', 'name': 'test_top1_acc'},
  {'id': '5ca216687ed6d385b4b7e1df', 'name': 'test_top5_acc'}]}

The code is the same between all experiments
The issue appears to be totally random, so really hard to reproduce
The run['info'] dict is never modified in my code
I don't know if the run['info'] dict is erased at the end of training of if it is downright not created at the begining of the experiment
The issue only appears when we run a lot of experiments in parallel
The weird thing is that we have all the experiments with different ids, so no 'overlap' strictly speaking
So the overlap display may come from Omniboard which use random metrics as display (because how the hell does it find metrics when the fs ids are not present ?!)
Still the erasing of the metrics is an issue

possible related issues:
#309
#345
#317

The text was updated successfully, but these errors were encountered:

flukeskywalker · 2019-04-08T10:06:59Z

I have faced this exact issue, but in my case it really was an ID overlap issue. To confirm you can try the hack I usually use, demonstrated here: #441 (comment)

Basically, I insert a random delay between the creation of Sacred experiments to make sure that there are no overlapping experiment IDs. I have not faced the issue in question since I started using this trick.

It is possible that the delay range needs to be bigger if you are launching a lot more experiments. For a few hundred experiments, a delay range of 0 to 60 seconds works fine for me.

F-Barto · 2019-04-10T09:00:51Z

Hi, thanks for the proposition.

I think the overlap display comes from Omniboard which appears to display random plots when run['info'] ={}. Because how the hell does Omniboard find metrics to display when the gfs ids of the metrics are not present ?!. The real issue is the erasing of the metrics. So it would comes from the update of the DB, or something like that I guess, not the run id itself.

vnmabus · 2019-04-10T09:12:03Z

I have faced this exact issue, but in my case it really was an ID overlap issue. To confirm you can try the hack I usually use, demonstrated here: #441 (comment)

Basically, I insert a random delay between the creation of Sacred experiments to make sure that there are no overlapping experiment IDs. I have not faced the issue in question since I started using this trick.

It is possible that the delay range needs to be bigger if you are launching a lot more experiments. For a few hundred experiments, a delay range of 0 to 60 seconds works fine for me.

For me, Mongo ID collisions disappeared after the merge of #254, without resorting to such hacks. Maybe you could try disabling them and reporting a new bug if you ever see ID collisions again.

flukeskywalker · 2019-04-12T12:06:34Z

@F-Barto: I doubt that that's the issue. The metrics don't have much to do with the info. See here: https://sacred.readthedocs.io/en/latest/collected_information.html#metrics-records

You can directly inspect the metrics for the problematic runs using pymongo. If you still see the same issue, the problem is not coming from Omniboard, and it is likely an ID issue (make sure that you're using the latest Sacred). If your problem is indeed coming from Omniboard, it is probably best to open an issue in the corresponding repo.

@vnmabus: Yes that's been on my TODO list, and I'll check soon.

F-Barto · 2019-05-07T12:54:59Z

Okay finally found it,

The person having the issue was using pymongo and not omniboard to delete the runs. At the same time, he did not delete the corresponding documents in the metrics collection. Hence the overlap of id on the metrics at some point.

Still, the fact that run['info'] is empty when the metrics documents already exist is weird.

Thx for your help all

F-Barto closed this as completed May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MongoObserver possible race condition for multiple runs #452

MongoObserver possible race condition for multiple runs #452

F-Barto commented Apr 8, 2019 •

edited

Loading

flukeskywalker commented Apr 8, 2019

F-Barto commented Apr 10, 2019 •

edited

Loading

vnmabus commented Apr 10, 2019

flukeskywalker commented Apr 12, 2019

F-Barto commented May 7, 2019

MongoObserver possible race condition for multiple runs #452

MongoObserver possible race condition for multiple runs #452

Comments

F-Barto commented Apr 8, 2019 • edited Loading

flukeskywalker commented Apr 8, 2019

F-Barto commented Apr 10, 2019 • edited Loading

vnmabus commented Apr 10, 2019

flukeskywalker commented Apr 12, 2019

F-Barto commented May 7, 2019

F-Barto commented Apr 8, 2019 •

edited

Loading

F-Barto commented Apr 10, 2019 •

edited

Loading