RFC for Keras Preprocessing redesign #6

fchollet · 2019-07-30T00:09:45Z

Keras-Preprocessing Redesign

Comment period open until August 18th, 2019.

Status	Proposed
Author(s)	Francois Chollet (fchollet@google.com)
Updated	2019-07-29

Context

tf.data.Dataset is the main API for data loading and preprocessing in TensorFlow. It has two advantages:

It supports GPU prefetching
It supports distribution via the Distribution Strategies API

Meanwhile, keras.preprocessing in a major API for data loading and preprocessing in Keras. It is based
on Numpy and Scipy, and it produces instances of the keras.utils.Sequence class, which are finite-length,
resettable Python generators that yield batches of data.

Some features of keras.preprocessing are highly useful and don't have straightforward equivalents in tf.data
(in particular image data augmentation and dynamic time series iteration).

Ideally, the utilities in keras.preprocessing should be made compatible with tf.data.
This presents the opportunity to improve on the existing API. In particular we don't have good support
for image segmentation use cases today.

Some features are also being subplanted by preprocessing layers, in particular text processing.
As a result we may want to deprecate them.

Goals

Make all features of keras.preprocessing compatible with tf.data
As a by-product, add required ops to TensorFlow
Improve the ergonomy and user-friendliness of the keras.preprocessing APIs

minimaxir · 2019-08-01T18:25:41Z

Another potential use case for this redesign is TensorFlow BigQuery ML, which as of now requires input transformations/datatypes to be TF native.

Only core TensorFlow operations are supported: models that use custom or tf.contrib operations are not supported.

AakashKumarNain · 2019-08-02T05:46:10Z

Data augmentation happens on actual images, before normalization (this is more correct, and necessary in order to visualize the effect of augmentation).

This was very much needed. Even to this date, once we apply augmentation, we have to write a sample code to make sure that everything is working fine, which is fine as it is very simple to do that but it makes sense to be the part of that API itself.

Dref360

Awesome!
I would be able to work on some of the features once this PR is merged.

Dref360 · 2019-08-02T19:09:53Z

rfcs/20190729-keras-preprocessing.md

+- Image loading and image augmentation should be synced across inputs and targets
+- It should be possible to use different standardization preprocessing (outside of augmentation) across inputs and targets
+
+**Proposal:** (TBD)


My only concern is that this proposal is not general enough to accommodate users. If a user wants to do object detection, for example, he will not use keras-preprocessing.

I can work on a new RFC, but I feel that generator/dataset_from_dataframe could be general enough to support it.

Have you checked out the proposal for segmentation? Could we use something similar for object detection? What's missing?

Right now the segmentation example works because these are images. In a case where we have boxes or keypoints, we don't have that. Furthermore, to do DA on those, we need to normalize those points between 0,1 or provide the shape of the original image.

What would be your preferred API for object detection?

We would need something similar to flow_from_dataframe and a way to request some info from the image like the shape.
We would also need the ability to change the way we augment 'samples' (image, boxes, keypoint).
Then we can support any task with little effort I think?

As I said, we should work on another RFCs as this feature is not in scope with this one.

Maybe I should bring it up during the meeting on Monday?

Let's discuss this today.

Dref360 · 2019-08-02T19:12:01Z

rfcs/20190729-keras-preprocessing.md

+
+Proposed changes: API simplification & standardization, and extension of supported types.
+
+- All functions should work on either single images or batches of images


Do we use the deprecation procedure for this?

As we rename arguments, we should keep the old names working (with a deprecation warning).

There is one case where we are changing the semantics of an argument in a non-backwards compatible way: the preprocessing_function would now be applied before the transformations rather than after. We could come up with new names in order to avoid breakages. Thoughts?

bhack · 2019-08-03T10:37:33Z

There was a 3d preprocessing request/thread at #5.
3d preprocessing was asked also at the Keras-Sig meeting.

PhilipMay · 2019-08-03T12:16:59Z

rfcs/20190729-keras-preprocessing.md

+    # Returns JSON-serializable config dict
+
+def preview(self, data, save_to_directory=None, save_prefix=None, save_format='png'):
+    """Enables users to previous the image augmentation configuration.


Enables users to preview the image augmentation configuration.

PhilipMay · 2019-08-03T12:33:02Z

A comment about resizing: When you specify resize the image will be distorted if the aspect ratio of original and resize does not match. This is something you might not want. I think it would be a nice feature to fill the image with a border to avoid distortion and preserve the aspect ratio. The fill mode should be possible to specify like fill_mode.

I am not sure if this is the right place to propose this as it might be out of scope of this RFC.

ekaakurniawan · 2019-08-05T05:36:09Z

How about using tf.image instead of PIL for image processing (types convertion, resizing and normalization)?

fchollet · 2019-08-05T17:58:55Z

A comment about resizing: When you specify resize the image will be distorted if the aspect ratio of original and resize does not match. This is something you might not want. I think it would be a nice feature to fill the image with a border to avoid distortion and preserve the aspect ratio. The fill mode should be possible to specify like fill_mode.

I am not sure if this is the right place to propose this as it might be out of scope of this RFC.

The resize argument is mostly meant to cover the cases where you don't care very much about these details.

If you do care about aspect ratio, or if you want to use a different kind of resizing strategy (e.g. center crop and resize, which is popular), I would suggest implementing it in preprocessing_function, which should be applied on single images before any resizing.

fchollet · 2019-08-05T17:59:40Z

How about using tf.image instead of PIL for image processing (types convertion, resizing and normalization)?

We want the image transformations to support both numpy arrays and tensors, so we would probably have both tf.image and scipy-based implementations.

AakashKumarNain · 2019-08-05T18:01:48Z

How about using tf.image instead of PIL for image processing (types convertion, resizing and normalization)?

We want the image transformations to support both numpy arrays and tensors, so we would probably have both tf.image and scipy-based implementations.

The only problem is that tf.image doesn't cover major functionalities provided by opencv and PIL as of now. Also, the implementation has differences and all users use opencv and PIL but not all of them use tf.image

AakashKumarNain · 2019-08-05T18:04:23Z

rfcs/20190729-keras-preprocessing.md

+#### Question: how to support image segmentation in a simple way?
+
+**Requirements:**
+- Image loading and image augmentation should be synced across inputs and targets


Keeping things synced is simple. Once the user passes the inputs and targets, we should store the input filename and target filename in a pandas dataframe.

inputfile targetfile 0 path_to_file1 path_to_target1 1 path_to_file2 path_to_target2 ... ... ...

In order to shuffle the data and keeping inputs and targets synced, we only have to shuffle the indices of this dataframe.

Any augmentation that is applied on a particular input will be applied on the target as well because they can be processed by the same thread in a sequential manner now.

It's almost what flow_from_dataframe already does. So maybe we can beef up this function to perform whatever we need.

AakashKumarNain · 2019-08-05T18:04:49Z

rfcs/20190729-keras-preprocessing.md

+
+**Requirements:**
+- Image loading and image augmentation should be synced across inputs and targets
+- It should be possible to use different standardization preprocessing (outside of augmentation) across inputs and targets


Wouldn't preprocessing_fn be enough for that?

PhilipMay · 2019-08-05T19:08:59Z

rfcs/20190729-keras-preprocessing.md

+    vertical_flip=False,
+    rescale=None,
+    preprocessing_function=None,
+    postprocessing_function=None,


There is some confusion with preprocessing_function and postprocessing_function here. The current implementation does not contrain postprocessing_function.

At line 194 below (in # PROPOSED section) it is missing.

fchollet · 2019-08-14T00:06:27Z

We will do the design review for this RFC on Monday 19th August at 2pm PT. I am leaving the comment period open until then.

fchollet · 2019-08-14T00:40:58Z

Working list of unresolved questions:

Is the proposed workflow for segmentation acceptable?
What can we do about object detection workflows?
Should support for 3D image data be added?
What new ops should be added to tf.image to support this project?

PhilipMay · 2019-08-14T13:58:36Z

Working list of unresolved questions:
* Is the proposed workflow for segmentation acceptable?
* What can we do about object detection workflows?
* Should support for 3D image data be added?
* What new ops should be added to tf.image to support this project?

From my point of view the RFC has still a minor bug. It sais that postprocessing_function is implemented (in CURRENT). But it should be placed in PROPOSED signature.

See above: #6 (review)

PhilipMay · 2019-08-14T14:03:28Z

One more comment:
When you implement two parameter named preprocessing_function and postprocessing_function the postprocessing_function is new. But the semantics of the preprocessing_function will change.

Let me explain this:

Currently preprocessing_function is basicaly a postprocessing function. It happens after the rest of the augmentation has been done.

When you implement a postprocessing_function in the future the postprocessing_function will get the semantics of the old preprocessing_function. This could lead to confusion and bugs.

I suggest to rename both parameters so the old parameter not available anymore.

karmel · 2019-08-14T15:45:52Z

rfcs/20190729-keras-preprocessing.md

+This presents the opportunity to improve on the existing API. In particular we don't have good support
+for image segmentation use cases today.
+
+Some features are also being subplanted by [preprocessing layers](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md), in particular text processing. 


nit: supplanted

karmel · 2019-08-14T15:48:55Z

rfcs/20190729-keras-preprocessing.md

+- Make all classes in `keras.preprocessing` JSON-serializable.
+- Deprecate `Tokenizer` class in favor of `TextVectorization` preprocessing layer
+- Make image-transformation functions in `affine_transformations` submodule compatible with TensorFlow tensor inputs (accepting tensors and returning tensors).
+- Improve signature of the above-mentioned functions, by using fully-spelled argument names, and fewer arguments if possible.


How are we approaching backwards compatibility here? Presumably this pertains to keras-team/keras initially, but we will pull into TF, at which point we will need to maintain all the old APIs. What's the plan?

In Keras we have a deprecation procedure in our Wiki.
We also have utils in keras to make the correct warnings.

karmel · 2019-08-14T16:05:28Z

rfcs/20190729-keras-preprocessing.md

+```python
+# PROPOSED
+
+def array_to_generator(


There seems to be some variation in from/to -- should these be standardized?

Yes, let's standardize them. We can either settle for (input)_to_(output) or (output)_from_(input)

karmel · 2019-08-14T16:10:29Z

rfcs/20190729-keras-preprocessing.md

+    seed=None,
+    subset=None):
+
+def dataset_from_dataframe(


Is there a world where we don't need all of these, and we can just determine data type from the passed object? String is dir path, array/dataframe can be type-checked. Then we just have to_dataset(), without the need for so many separate APIs.

That would be a simplification in terms of the set of methods available, but the signatures would get quite complex / confusing since each action has its own specific keyword arguments on top of the shared ones (e.g. follow_links, target_mode...)

karmel · 2019-08-14T16:14:35Z

rfcs/20190729-keras-preprocessing.md

+    preprocessing_function=None,
+    interpolation_order=1,
+    data_format='channels_last',
+    dtype='float32'):


This is a lot of constructor args, and it biases us towards adding more with each new feature, rather than finding other ways of setting parameters. Is this the best way to do this, versus having subclasses that handle subsets, or deferring collecting some of these until the method in which they are needed?

I guess we can decompose this part in 2 parts: the normalization and the augmentation. The rest is shared. It may add some complexity to what the user needs to do.

There is one context in which having lots of arguments is acceptable: when they form a flat list-like configuration, i.e adding more arguments results in linear complexity and maintenance workload, because they don't interact with each other. This is the case here. We're just describing a list of image transformations. Once you have the logic infrastructure to do one image transformation, then doing arbitrarily many does not add much.

jsimsa · 2019-08-14T18:03:58Z

rfcs/20190729-keras-preprocessing.md

+
+`tf.data.Dataset` is the main API for data loading and preprocessing in TensorFLow. It has two advantages:
+
+- It supports GPU prefetching


"It supports GPU prefetching" is a small subset of the following advantages:

it supports parallelization of transformations

it supports parallelization of I/O

it supports software pipelining (e.g. prefetching to GPU / TPU)

I think the firsts two advantages you mentioned are not there because they are already supported by the current API.

martinwicke · 2019-08-14T16:16:34Z

rfcs/20190729-keras-preprocessing.md

+- Remove the submobules `image`, `text`, `timeseries` from the public API and expose their contents as part of the top-level `keras.preprocessing`.
+- Make all classes in `keras.preprocessing` JSON-serializable.
+- Deprecate `Tokenizer` class in favor of `TextVectorization` preprocessing layer
+- Make image-transformation functions in `affine_transformations` submodule compatible with TensorFlow tensor inputs (accepting tensors and returning tensors).


I assume this would necessitate new TF ops -- would we expose those in tf.image? I'd like to make that API more complete.

There are also some ops from Deepmind at https://github.com/deepmind/multidim-image-augmentation

martinwicke · 2019-08-14T16:20:41Z

rfcs/20190729-keras-preprocessing.md

+- Rename methods `flow` to `generator_from_arrays`, `flow_from_directory` to `generator_from_directory`, and `flow_from_dataframe` to `generator_to_dataframe`.
+- Improve signature of the above-mentioned methods.
+- Figure out how to support image segmentation use cases with `ImageDataGenerator` (open question).
+- Refactor `TimeseriesGenerator` to follow a similar design as `ImageDataGenerator`. Only configuration arguments should be passed in the constructor. The data should be passed to methods such as `dataset_from_arrays`.


Very broad question: Is it worth having this two level structure (IIUC): A Generator class configured with a few options, which then offers methods to create generators or Datasets? Would it be easier to have a flat structure of factory methods (or classes) which simply take configuration and data and produce an appropriate Dataset / generator / array?

In other words, what does the separation of data and config buy us? Do people reuse a (say) ImageDataGenerator several times to produce different data using the same configured preprocessing?

I don't think people reuse their ImageDataGenerator, but both calls to ImageDataGenerator and dataset_from_dataframe requires many arguments which might confuse the user if done in a single function.

The factoring between config and data is there for 2 reasons:

to enable reusing a config on new data; this is how we plan to support image segmentation for instance (target masks have to be augmented in the same way as inputs)

makes it easier for people to think about data augmentation/normalization and loading separately, in two steps, which would otherwise form a very large and potentially confusing single step.

Fix small typo

fchollet · 2019-08-21T20:45:12Z

Following the design review meeting on Monday, we have decided to take more time to significantly rewrite this proposal. I'll be closing the PR for the time being, and we will open a new PR with the revised proposal at a later date.

fchollet added 2 commits July 29, 2019 17:07

Add RFC for Keras Preprocessing redesign

140f2f8

update link

82c343c

fchollet changed the title ~~Add RFC for Keras Preprocessing redesign~~ RFC for Keras Preprocessing redesign Aug 1, 2019

Dref360 reviewed Aug 2, 2019

View reviewed changes

PhilipMay reviewed Aug 3, 2019

View reviewed changes

fix typo

ba81369

AakashKumarNain reviewed Aug 5, 2019

View reviewed changes

PhilipMay reviewed Aug 5, 2019

View reviewed changes

karmel reviewed Aug 14, 2019

View reviewed changes

dfalbel added 2 commits August 14, 2019 13:28

typo

0a3af7a

typo

521f31b

jsimsa reviewed Aug 14, 2019

View reviewed changes

martinwicke reviewed Aug 14, 2019

View reviewed changes

fchollet added 3 commits August 15, 2019 11:56

Merge pull request #9 from dfalbel/patch-1

2056067

Fix small typo

Fix postprocessing_function docs

6ce5cdc

Fix postprocessing_function docs

9da1bc6

Dref360 mentioned this pull request Aug 21, 2019

[WIP] Add random image augmentations tensorflow/addons#245

Closed

fchollet closed this Aug 21, 2019

zaccharieramzi mentioned this pull request Oct 7, 2019

ImageDataGenerator resizing and transforming images before preprocessing keras-team/keras-preprocessing#95

Open


		Proposed changes: API simplification & standardization, and extension of supported types.

		- All functions should work on either single images or batches of images


		`tf.data.Dataset` is the main API for data loading and preprocessing in TensorFLow. It has two advantages:

		- It supports GPU prefetching

RFC for Keras Preprocessing redesign #6

RFC for Keras Preprocessing redesign #6

Conversation

fchollet commented Jul 30, 2019 • edited Loading

Keras-Preprocessing Redesign

Context

Goals

minimaxir commented Aug 1, 2019

AakashKumarNain commented Aug 2, 2019 • edited Loading

Dref360 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet Aug 5, 2019 • edited Loading

Choose a reason for hiding this comment

bhack commented Aug 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilipMay commented Aug 3, 2019

ekaakurniawan commented Aug 5, 2019

fchollet commented Aug 5, 2019

fchollet commented Aug 5, 2019

AakashKumarNain commented Aug 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilipMay Aug 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet commented Aug 14, 2019 • edited Loading

fchollet commented Aug 14, 2019

PhilipMay commented Aug 14, 2019 • edited Loading

PhilipMay commented Aug 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet Aug 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dref360 Aug 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet commented Aug 21, 2019

fchollet commented Jul 30, 2019 •

edited

Loading

AakashKumarNain commented Aug 2, 2019 •

edited

Loading

fchollet Aug 5, 2019 •

edited

Loading

AakashKumarNain commented Aug 5, 2019 •

edited

Loading

PhilipMay Aug 5, 2019 •

edited

Loading

fchollet commented Aug 14, 2019 •

edited

Loading

PhilipMay commented Aug 14, 2019 •

edited

Loading

fchollet Aug 19, 2019 •

edited

Loading

Dref360 Aug 15, 2019 •

edited

Loading