Workers in the distributed scenario need to see different instances #4241

dirkgr · 2020-05-15T01:19:58Z

No description provided.

Flake still sucks though. It doesn't like assigning lambdas.

dirkgr · 2020-05-15T01:20:22Z

There are no tests and I'm not done testing it manually yet either. That said, @matt-gardner, thoughts?

matt-gardner

The basic approach seems fine to me, other than the use of lambda functions.

matt-gardner · 2020-05-15T15:07:01Z

allennlp/data/dataset_readers/dataset_reader.py

+            read_fn = self._read
+            if self.max_instances is not None:
+                # Double lambda ensures that read_fn doesn't call itself recursively.
+                read_fn = lambda f: (


You can't use lambda functions with pickle, so this doesn't look like it'll work in the distributed case, which shares objects via pickling. You have to make this a class method if you want to do this redirection. That should also make it so it's much easier to understand, also, because you can just call self._read in that function, instead of passing functions around.

Same comment on the one below. You just need to add two class functions instead of this, and set read_fn = self._read in the default case, and read_fn = self._something_else in the other cases. Though you might want to just use one function for the other cases, having it check self.max_instances on its own; the alternative again requires passing curried functions or lambdas around, which won't work with pickle.

This does work. They don't need to be pickled as they are instantiated in the workers. But I'm not sure this is the most readable way of wrapping these functions.

Are you certain? Have you tried lazy + distributed together? Passing a lambda function to LazyInstances below was precisely the cause of a problem that we fixed recently: #4026.

Actually, I think you are right. This only worked in my experiments because I wasn't trying a lazy dataset.

50 lines of code later, this now also works with lazy datasets.

matt-gardner · 2020-05-15T15:15:30Z

allennlp/data/dataset_readers/dataset_reader.py

@@ -209,7 +230,12 @@ def read(self, file_path: str) -> Dataset:
                )

            # And finally we write to the cache if we need to.
-            if cache_file and not os.path.exists(cache_file):
+            if (
+                self.max_instances is None


This line made me pause and wonder if it was correct. I think I see why you did this (if you've set that, you're probably testing something, and don't want to cache only a part of the data as if it's the whole thing), but it's not obvious at first glance, so a comment explaining why it's there would be nice.

An alternative to having this check here would be to move the caching logic to above the place where you only keep max_instances (or move the slicing to below this). I'd probably vote for that option, instead of this. Well, that then might defeat the point of the max_instances, because you'd be reading all of the data... I guess it depends on which flag you think takes precedence. There's a fair argument that max_instances should take precedence, and this check should stay as it is. In that case, noting this in the docstring (that max_instances disables saving data to the cache) would be good.

I added this because I didn't want to write 500 instances to the cache, and then treat them as the whole dataset when reading the cache. I think the old code would do that.

There are two ways to avoid that. We could make the number of instances part of the filename of the cache, or we could never cache when max_instances is set. The latter seemed easier. It takes almost no time to read 500 instances anyways.

Yes, I agree, I ended up at the same place at the end of my comment. We should just add a comment and update the docstring to make this clear.

…nism

This reverts commit 1d5492a.

This reverts commit c67403e.

matt-gardner

LGTM when you think it's ready.

dirkgr · 2020-05-21T03:46:40Z

@matt-gardner, you already approved this, but I added quite a bit more. Do you want to give it another look?

matt-gardner

Your solution with classes seems better to me than the extra functions I was suggesting. LGTM, with just a few minor comments.

matt-gardner · 2020-05-21T15:17:31Z

allennlp/data/dataset_readers/dataset_reader.py

+        In the case that you have an IterableDataset and you call len, the pytorch dataloader
+        actually spits out a warning - but we need actually calling it to not crash.
+        """
+        return 1


I think we decided in another thread that this should crash, instead of returning 1, didn't we? But you're just moving this logic, not changing it, so if you want to leave that for a separate PR, that's fine with me.

Yes, that's what I was thinking. There is enough going on in this one as it is.

matt-gardner · 2020-05-21T15:23:37Z

allennlp/training/trainer.py

@@ -122,22 +122,29 @@ def __call__(
        epoch: int,
        batch_number: int,
        is_training: bool,
+        is_master: bool,


You don't need this, do you? It's already queryable from the trainer argument. You also didn't add it to the EpochCallback.

matt-gardner · 2020-05-21T15:24:43Z

allennlp/training/trainer.py

+                    epoch,
+                    batches_this_epoch,
+                    is_training=True,
+                    is_master=self._master,


If the issue is the opaque private argument, I'd vote for adding a simple is_master() method to the trainer, instead of passing yet another flag here.

I'll be easily swayed one way or another in this matter, but my reasoning for having it this way was this: When you implement a BatchCallback, it's easy to forget about the multi-process case, but you almost certainly need to think about it. By making it a parameter, it becomes more visible and harder to ignore.

That's a good point that I hadn't thought of. Given that, I could also go either way here. Whatever you think is best. Is the epoch callback only ever called from master?

Yes, the epoch callback is only called from master. I thought it would make more sense, but open reflection, I'm not sure that's true. I'll make another PR that adds the same thing to the epoch callback.

Actually, that wasn't even true. The epoch callback was called all the time. Glad I checked!

matt-gardner · 2020-05-21T15:26:24Z

allennlp/training/trainer.py

+        pass
+
+
+BatchCallback.register("null")(BatchCallback)


Just FYI, if that check in FromParams that we added this for bothers us in the future, we should probably just remove it. It's potentially brittle, and this is an easier solution.

Maybe, but in this case it's easy to write a null implementation. Not all classes are like that. But I guess we can always throw NotImplementedError().

matt-gardner · 2020-05-21T15:27:29Z

tests/commands/train_test.py

+                for metadata in batch["metadata"]:
+                    logger.info(f"First word from training data: '{metadata['words'][0]}'")
+
+    def in_worker(self, *args, **kwargs) -> None:


What's this doing? I don't see it called anywhere.

Sorry, that was a leftover from an earlier iteration. I removed it.

dirkgr added 4 commits May 14, 2020 17:57

Let's use our own convenience functions

8eee351

Make sure a worker process only runs on what it needs to run on

2da5284

Merge remote-tracking branch 'origin/master' into DistributedReader

9fbc230

Formatting

79fefa9

Flake still sucks though. It doesn't like assigning lambdas.

dirkgr requested a review from matt-gardner May 15, 2020 01:19

matt-gardner reviewed May 15, 2020

View reviewed changes

dirkgr added 11 commits May 15, 2020 14:40

Fix documentation

840c987

Try to make this work with lazy datasets

2fc6422

So much code for so little effect

5257f9d

Productivity through formatting

9092541

Fixes the test for dataset sharding

c67403e

Fix another sharding test

1d5492a

Sharded dataset reader should always disable the other sharding mecha…

f950bb5

…nism

Revert "Fix another sharding test"

356bc4a

This reverts commit 1d5492a.

Revert "Fixes the test for dataset sharding"

fd54e22

This reverts commit c67403e.

Test the new sharding behavior

bd5e284

Try to get the test to find the callback

2a6cadd

matt-gardner approved these changes May 16, 2020

View reviewed changes

dirkgr added 11 commits May 20, 2020 10:38

Merge remote-tracking branch 'origin/master' into DistributedReader

1aed503

Update changelog

959293b

Formatting

170c397

Adds null callbacks for batch and epoch

b24038b

Formatting

967fba6

Fix the logging in the test

6d03c4f

Be verbose

6a1b237

Fix test expectations

7fd111f

Add a callback that gets called in workers

7d0517a

Also call in_worker during validation

477ea66

Fix test

3bc3391

dirkgr added 10 commits May 20, 2020 14:41

Productivity through formatting

2844e85

More formatting

dcd793b

Merge remote-tracking branch 'origin/master' into DistributedReader

0a08e74

Fix test after they were moved

e96b736

Silence mypy

ee03a53

Set correct types

7b7e089

Docs

bce9f65

Less code, same result

a241659

Fix another test

81c3817

Test both lazy and eager cases

2315b01

dirkgr marked this pull request as ready for review May 21, 2020 03:46

matt-gardner approved these changes May 21, 2020

View reviewed changes

Remove leftover code from an earlier iteration

9523858

dirkgr merged commit 7e683dd into master May 21, 2020

dirkgr deleted the DistributedReader branch May 21, 2020 17:43

This was referenced May 21, 2020

Data loading for multi-GPU has unexpected behavior #4087

Closed

Makes the EpochCallback work the same way as the BatchCallback #4277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers in the distributed scenario need to see different instances #4241

Workers in the distributed scenario need to see different instances #4241

dirkgr commented May 15, 2020

dirkgr commented May 15, 2020

matt-gardner left a comment

matt-gardner May 15, 2020

dirkgr May 15, 2020

matt-gardner May 15, 2020

dirkgr May 15, 2020

dirkgr May 15, 2020

matt-gardner May 15, 2020

dirkgr May 15, 2020

matt-gardner May 15, 2020

dirkgr May 15, 2020

matt-gardner left a comment

dirkgr commented May 21, 2020

matt-gardner left a comment

matt-gardner May 21, 2020

dirkgr May 21, 2020

matt-gardner May 21, 2020

matt-gardner May 21, 2020

dirkgr May 21, 2020

matt-gardner May 21, 2020

dirkgr May 22, 2020

dirkgr May 22, 2020

matt-gardner May 21, 2020

dirkgr May 21, 2020

matt-gardner May 21, 2020

dirkgr May 21, 2020

		pass


		BatchCallback.register("null")(BatchCallback)

Workers in the distributed scenario need to see different instances #4241

Workers in the distributed scenario need to see different instances #4241

Conversation

dirkgr commented May 15, 2020

dirkgr commented May 15, 2020

matt-gardner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-gardner left a comment

Choose a reason for hiding this comment

dirkgr commented May 21, 2020

matt-gardner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment