Datapipeline poc #130

justusschock · 2021-02-18T16:54:52Z

What does this PR do?

This is just some API prototype. So far it is not completely working. It is basically just meant as a discussion starter :)

Fixes #67

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests? [not needed for typos/docs]
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-02-18T16:54:54Z

Hello @justusschock! Thanks for updating this PR.

In the file flash/core/model.py:

Line 214:5: E266 too many leading '#' for block comment
Line 271:12: E713 test for membership should be 'not in'
Line 304:121: E501 line too long (124 > 120 characters)

In the file flash/data/data_pipeline.py:

Line 30:121: E501 line too long (138 > 120 characters)
Line 38:121: E501 line too long (132 > 120 characters)
Line 41:121: E501 line too long (205 > 120 characters)
Line 50:121: E501 line too long (205 > 120 characters)
Line 55:121: E501 line too long (140 > 120 characters)
Line 200:121: E501 line too long (160 > 120 characters)

In the file flash/data/postprocessing_pipeline.py:

Line 56:121: E501 line too long (140 > 120 characters)
Line 134:121: E501 line too long (174 > 120 characters)

Comment last updated at 2021-02-22 12:13:55 UTC

codecov · 2021-02-18T16:58:40Z

Codecov Report

Merging #130 (31f65a5) into master (a6edeab) will decrease coverage by 81.27%.
The diff coverage is 9.67%.

@@            Coverage Diff             @@
##           master    #130       +/-   ##
==========================================
- Coverage   87.39%   6.12%   -81.28%     
==========================================
  Files          49      51        +2     
  Lines        1579    1846      +267     
==========================================
- Hits         1380     113     -1267     
- Misses        199    1733     +1534

Flag	Coverage Δ
unittests	`6.12% <9.67%> (-81.28%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
flash/core/model.py	`6.36% <3.94%> (-89.33%)`	⬇️
flash/data/postprocessing_pipeline.py	`6.41% <6.41%> (ø)`
flash/data/data_pipeline.py	`15.20% <15.20%> (ø)`
flash/text/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
flash/vision/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
flash/text/seq2seq/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
flash/vision/detection/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
flash/vision/embedding/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
flash/text/seq2seq/core/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
flash/vision/classification/model.py	`0.00% <0.00%> (-100.00%)`	⬇️
... and 39 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6edeab...17cecb8. Read the comment docs.

justusschock · 2021-02-20T14:25:40Z

flash/core/model.py

+    @property
+    def postprocessing_pipeline(self) -> PostProcessingPipeline:
+        return self._get_pipeline('postprocessing')
+


TODO: Missing setter

justusschock · 2021-02-20T14:26:00Z

flash/core/model.py

        """Pipeline to use when there is no datamodule or it has not defined its pipeline"""
-        return DataModule.default_pipeline()
+        return DataModule.default_data_pipeline()


TODO: also do this for postprocessing

justusschock · 2021-02-20T14:26:33Z

flash/core/model.py

@@ -188,3 +210,111 @@ def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:

    def configure_finetune_callback(self):
        return []
+
+    ### THE FOLLOWING IS A POC FOR DISTRIBUTED PREDICTION
+    def on_predict_start(self):


@tchaton does it make sense to have a hook like that (I think we need to revisit lightning hooks in general for all stages)

Where is it called ? I guess we could add hook for predict. Need a bit more exploration there.

Similar to on_fit_start, would be called immediately after the Trainer.predict was called

justusschock · 2021-02-20T14:28:03Z

flash/core/model.py

+            self.postprocessing_pipeline._attach_to_model(self)
+
+    def predict_step(self, batch, batch_idx):
+        # TODO: Move lightning predict loop from predict to predict_step


@tchaton You mentioned the prediction API is not final in lightning, right?

IMO it makes sense to rename it to training_step within the LightningModule, since (similar to train step etc.) it only runs prediction for one batch at a time, making it more of a step (plus we can use the predict keyword here independently :) )

Yes, I am good with that :)

That was my initial proposal but it got turned down because users are expected to write

model.predict(...)

but not

model.predict_step()

because nobody calls

model.training_step

justusschock · 2021-02-20T14:28:52Z

flash/core/model.py

+        # TODO: Also use these for trainer creation in training?
+        # TODO: Have default trainer kwargs per task?
+        _trainer_kwargs = {}
+        # TODO: Adjust this to trainer running stage from pl


any thoughts on that @tchaton @aribornstein ?

I like it. We had similar function in previous iteration of predict.

I k is we had something like that for training in the beginning. The only downside I See, is That it hides away the Lightning Trainer

Maybe we could provide an optional argument for the user to provide trainer in case they don't want to use the default trainer?

you mean another trainer class?
Actually, the trainer class is something I'd hardcode here tbh.
This is one of the very fundamental lightning aspects and I feel if a user wants to change it, he either should look into customization with callbacks/plugins or subclass the task to overwrite it here directly.

justusschock · 2021-02-20T14:29:49Z

flash/data/data_pipeline.py

+
+        return AutoDataset(data=data, load_fn=load_fn, load_per_sample=load_per_sample)
+
+    def _generate_loader(


should me make this public API? @tchaton

Yes, and it should be to_dataloader.

justusschock · 2021-02-20T14:29:55Z

flash/data/data_pipeline.py

+        )
+        return model
+
+    def _generate_auto_dset(self, data: Union[Iterable, Any]) -> AutoDataset:


should me make this public API? @tchaton

I think to_dataloader is enough.

tchaton

Looks great overall. I would have to covert some current pipeline to this new API and see how it feels.

tchaton · 2021-02-20T15:21:24Z

flash/core/model.py

+        return self._get_pipeline('postprocessing')
+
+    def _get_pipeline(self, pipeline_type: str):
+        pipeline_attr_name = f'{pipeline_type}_pipline'


_pipline typo ?

tchaton · 2021-02-20T15:22:23Z

flash/core/model.py

    @staticmethod
-    def default_pipeline() -> DataPipeline:
+    def default_data_pipeline() -> DataPipeline:


I think here we should take the data-type default one ? Example collate for text isn't the same than for vision.

Yeah, but that's why each task would have its own default

tchaton · 2021-02-20T15:25:58Z

flash/core/model.py

+
+        if self.trainer is not None and hasattr(self.trainer, 'datamodule') and self.trainer.datamodule is not None:
+            if hasattr(self.trainer.datamodule,
+                       pipeline_attr_name) and getattr(self.trainer.datamodule, pipeline_attr_name is not None):


When can pipeline_attr_name be None ?

It can't, that should be outside the brackets :)

tchaton · 2021-02-20T15:27:43Z

flash/core/model.py

@@ -188,3 +210,111 @@ def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:

    def configure_finetune_callback(self):
        return []
+
+    ### THE FOLLOWING IS A POC FOR DISTRIBUTED PREDICTION
+    def on_predict_start(self):


Where is it called ? I guess we could add hook for predict. Need a bit more exploration there.

tchaton · 2021-02-20T15:28:21Z

flash/core/model.py

+            self.postprocessing_pipeline._attach_to_model(self)
+
+    def predict_step(self, batch, batch_idx):
+        # TODO: Move lightning predict loop from predict to predict_step


Yes, I am good with that :)

tchaton · 2021-02-20T15:49:36Z

flash/data/data_pipeline.py

+        elif post_collate_overriden:
+            worker_collate = collate_fn
+            device_collate = self._do_nothing_collate
+
+        elif device_pre_collate_overriden:
+            worker_collate = self._do_nothing_collate
+            device_collate = collate_fn
+
+        else:
+            worker_collate = collate_fn
+            device_collate = self._do_nothing_collate


Suggested change

elif post_collate_overriden:

worker_collate = collate_fn

device_collate = self._do_nothing_collate

elif device_pre_collate_overriden:

worker_collate = self._do_nothing_collate

device_collate = collate_fn

else:

worker_collate = collate_fn

device_collate = self._do_nothing_collate

if device_pre_collate_overriden:

worker_collate = self._do_nothing_collate

device_collate = collate_fn

else:

worker_collate = collate_fn

device_collate = self._do_nothing_collate

tchaton · 2021-02-20T15:55:12Z

flash/data/data_pipeline.py

+                    was_seq = False
+
+                for idx, loader in enumerate(dataloader):
+                    if isinstance(loader, DataLoader):


This won't work with custom dataloader. See data_loading.py in Lightning.

That's why it's guarded like this. But IMO we shouldn't expect any custom loaders here, since then people would be using lightning. Also you cannot pitch this for custom loaders without knowing their internals

I don't agree that we shouldn't expect custom data loaders, I just ran into a huge issue with this today if I want to extend flash capabilities, I shouldn't have to implement a lightning datamodule from scratch to take advantage of lightnings features. I should be able to extend lightning as needed.

Yes, but what kind of interface do you want to assume in that case? E.g. if you have a custom loader class, there might not even be something we can attach to...

tchaton · 2021-02-20T15:56:17Z

flash/data/data_pipeline.py

+
+                setattr(model, loader_name, dataloader)
+
+        model.transfer_batch_to_device = (


Love this ! Pretty smart !

tchaton · 2021-02-20T15:57:16Z

flash/data/data_pipeline.py

+            if auto_collate:
+                loader_kwargs['collate_fn'] = default_collate
+            else:
+                loader_kwargs['collate_fn'] = default_convert


Isn't default_convert used only for numpy array ?

No, default convert also converts numbers etc. basically does the same as default_collate without tensor stacking

tchaton · 2021-02-20T15:58:13Z

flash/core/model.py

-            except AttributeError:
-                self._data_pipeline = self.default_pipeline()
-        return self._data_pipeline
+        return self._get_pipeline('data')

    @data_pipeline.setter
    def data_pipeline(self, data_pipeline: DataPipeline) -> None:


I find it a bit confusing to have DataPipeline and PostProcessingPipeline as people might expect a PreprocessingPipeline. Worth to iterate on this one.

Yes, thought so as well. Basically I named it data_pipeline since it does loading + preprocessing. But fine with changing it as well

It also does postprocessing, with after_uncollate

t-vi · 2021-02-20T16:36:30Z

@tchaton pointed me here. I must admit I don't feel that I figure out the design behind the code.

If I may make two observations despite my limited understanding, there are two things where I have the impression that the API you are creating here is not aligned with how I think about my data and my models and how they meet:

I don't think all preprocessing should be tied to the dataloader (/collate fn). To me "same datamodule, different augmentation" or "I have an image from somewhere (e.g. the webcam) but want some part of preprocessing" can happen and it seems unnatural to have to modify the datamodule for this or having to apply parts manually.
My impression is that the model here is to deal with stuff passed in to new_predict etc. is to set self.something and then use self as state. As a user this isn't what I'd expect (I would expect that the state is fixed and that I might override parts of it through the args for this one call.).

Part of this might not be solvable within flash itself but might need amending lightning (in particular, there seems no "auxiliary information" going into the train loop/step except the dataloader).

justusschock · 2021-02-20T17:30:03Z

@t-vi Thanks for your comments, they are very valuable.

Regarding your first point:

Why not tie everything to the loader? Creating a loader (especially one with num_workers=0) is almost no overhead.
In terms of augmentations: You can for example have them as an init argument for your pipeline.
To the datamodule part: You're right and the integration with the datamodule is definitely not perfect. This should just be a base to iterate on :)

Regarding your second point:
You're right to some extend (And the part where you may not be right, is not visible here so let me explain my thoughts behind that). Yes I kind of wanted to use the model as a state, but only temporary. The part that's still missing here (but should definitely come) is to revert back to the original state.

What kind of API would you expect as a user? Maybe we can look on how we can integrate this kind of API into flash/lightning :)

carmocca

I must be missing lots of context here because this goes in a complete different direction to what we initially discussed for Flash, right?

We agreed on using model.predict for simple inference an trainer.predict for distributed inference? Has there been further developments about this? What's the user facing API with these changes?

Also what is PostprocessingPipeline? Why is it separate from DataPipeline?

carmocca · 2021-02-21T01:30:44Z

flash/core/model.py

-            except AttributeError:
-                self._data_pipeline = self.default_pipeline()
-        return self._data_pipeline
+        return self._get_pipeline('data')

    @data_pipeline.setter
    def data_pipeline(self, data_pipeline: DataPipeline) -> None:


It also does postprocessing, with after_uncollate

carmocca · 2021-02-21T01:34:26Z

flash/core/model.py

+            self.postprocessing_pipeline._attach_to_model(self)
+
+    def predict_step(self, batch, batch_idx):
+        # TODO: Move lightning predict loop from predict to predict_step


That was my initial proposal but it got turned down because users are expected to write

model.predict(...)

but not

model.predict_step()

because nobody calls

model.training_step

carmocca · 2021-02-21T01:36:27Z

flash/core/model.py

+    def predict_step(self, batch, batch_idx):
+        # TODO: Move lightning predict loop from predict to predict_step
+        if isinstance(batch, (tuple, list)) and len(batch) == 2:
+            x, y = batch


Yeah, this should be here in the future

https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#transfer-batch-to-device

justusschock added 2 commits February 18, 2021 17:53

add prototype of DataPipeline

45691cd

Add Prototype of PostProcessingPipeline

135eb17

justusschock added the API / design label Feb 18, 2021

justusschock self-assigned this Feb 18, 2021

isort + pep8

535353c

justusschock added 3 commits February 20, 2021 15:20

update post_processing_pipeline

f66f223

update data pipline

67de76f

add new prediction part

3be12a3

justusschock commented Feb 20, 2021

View reviewed changes

tchaton reviewed Feb 20, 2021

View reviewed changes

carmocca reviewed Feb 21, 2021

View reviewed changes

change loader name

17cecb8

carmocca mentioned this pull request Feb 24, 2021

Update lightning version to v1.2 #133

Merged

8 tasks

justusschock closed this Mar 10, 2021

justusschock deleted the datapipeline_poc branch March 10, 2021 15:23

Borda mentioned this pull request Mar 10, 2021

DataPipeline PoC #141

Merged

19 tasks


		return AutoDataset(data=data, load_fn=load_fn, load_per_sample=load_per_sample)

		def _generate_loader(


		setattr(model, loader_name, dataloader)

		model.transfer_batch_to_device = (

Datapipeline poc #130

Datapipeline poc #130

Conversation

justusschock commented Feb 18, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Feb 18, 2021 • edited Loading

Comment last updated at 2021-02-22 12:13:55 UTC

codecov bot commented Feb 18, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aribornstein Feb 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t-vi commented Feb 20, 2021

justusschock commented Feb 20, 2021

carmocca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justusschock commented Feb 18, 2021 •

edited

Loading

pep8speaks commented Feb 18, 2021 •

edited

Loading

codecov bot commented Feb 18, 2021 •

edited

Loading

aribornstein Feb 20, 2021 •

edited

Loading