Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

RFC for Keras Preprocessing redesign #6

Closed
wants to merge 8 commits into from

Conversation

fchollet
Copy link
Contributor

@fchollet fchollet commented Jul 30, 2019

Keras-Preprocessing Redesign

Comment period open until August 18th, 2019.

Status Proposed
Author(s) Francois Chollet (fchollet@google.com)
Updated 2019-07-29

Context

tf.data.Dataset is the main API for data loading and preprocessing in TensorFlow. It has two advantages:

  • It supports GPU prefetching
  • It supports distribution via the Distribution Strategies API

Meanwhile, keras.preprocessing in a major API for data loading and preprocessing in Keras. It is based
on Numpy and Scipy, and it produces instances of the keras.utils.Sequence class, which are finite-length,
resettable Python generators that yield batches of data.

Some features of keras.preprocessing are highly useful and don't have straightforward equivalents in tf.data
(in particular image data augmentation and dynamic time series iteration).

Ideally, the utilities in keras.preprocessing should be made compatible with tf.data.
This presents the opportunity to improve on the existing API. In particular we don't have good support
for image segmentation use cases today.

Some features are also being subplanted by preprocessing layers, in particular text processing.
As a result we may want to deprecate them.

Goals

  • Make all features of keras.preprocessing compatible with tf.data
  • As a by-product, add required ops to TensorFlow
  • Improve the ergonomy and user-friendliness of the keras.preprocessing APIs

@fchollet fchollet changed the title Add RFC for Keras Preprocessing redesign RFC for Keras Preprocessing redesign Aug 1, 2019
@minimaxir
Copy link

Another potential use case for this redesign is TensorFlow BigQuery ML, which as of now requires input transformations/datatypes to be TF native.

Only core TensorFlow operations are supported: models that use custom or tf.contrib operations are not supported.

@AakashKumarNain
Copy link

AakashKumarNain commented Aug 2, 2019

Data augmentation happens on actual images, before normalization (this is more correct, and necessary in order to visualize the effect of augmentation).

This was very much needed. Even to this date, once we apply augmentation, we have to write a sample code to make sure that everything is working fine, which is fine as it is very simple to do that but it makes sense to be the part of that API itself.

Copy link

@Dref360 Dref360 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!
I would be able to work on some of the features once this PR is merged.

- Image loading and image augmentation should be synced across inputs and targets
- It should be possible to use different standardization preprocessing (outside of augmentation) across inputs and targets

**Proposal:** (TBD)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that this proposal is not general enough to accommodate users. If a user wants to do object detection, for example, he will not use keras-preprocessing.

I can work on a new RFC, but I feel that generator/dataset_from_dataframe could be general enough to support it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked out the proposal for segmentation? Could we use something similar for object detection? What's missing?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the segmentation example works because these are images. In a case where we have boxes or keypoints, we don't have that. Furthermore, to do DA on those, we need to normalize those points between 0,1 or provide the shape of the original image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be your preferred API for object detection?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need something similar to flow_from_dataframe and a way to request some info from the image like the shape.
We would also need the ability to change the way we augment 'samples' (image, boxes, keypoint).
Then we can support any task with little effort I think?

As I said, we should work on another RFCs as this feature is not in scope with this one.

Maybe I should bring it up during the meeting on Monday?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this today.


Proposed changes: API simplification & standardization, and extension of supported types.

- All functions should work on either single images or batches of images
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use the deprecation procedure for this?

Copy link
Contributor Author

@fchollet fchollet Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we rename arguments, we should keep the old names working (with a deprecation warning).

There is one case where we are changing the semantics of an argument in a non-backwards compatible way: the preprocessing_function would now be applied before the transformations rather than after. We could come up with new names in order to avoid breakages. Thoughts?

@bhack
Copy link
Contributor

bhack commented Aug 3, 2019

There was a 3d preprocessing request/thread at #5.
3d preprocessing was asked also at the Keras-Sig meeting.

# Returns JSON-serializable config dict

def preview(self, data, save_to_directory=None, save_prefix=None, save_format='png'):
"""Enables users to previous the image augmentation configuration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enables users to preview the image augmentation configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@PhilipMay
Copy link

A comment about resizing: When you specify resize the image will be distorted if the aspect ratio of original and resize does not match. This is something you might not want. I think it would be a nice feature to fill the image with a border to avoid distortion and preserve the aspect ratio. The fill mode should be possible to specify like fill_mode.

I am not sure if this is the right place to propose this as it might be out of scope of this RFC.

@ekaakurniawan
Copy link

How about using tf.image instead of PIL for image processing (types convertion, resizing and normalization)?

@fchollet
Copy link
Contributor Author

fchollet commented Aug 5, 2019

A comment about resizing: When you specify resize the image will be distorted if the aspect ratio of original and resize does not match. This is something you might not want. I think it would be a nice feature to fill the image with a border to avoid distortion and preserve the aspect ratio. The fill mode should be possible to specify like fill_mode.

I am not sure if this is the right place to propose this as it might be out of scope of this RFC.

The resize argument is mostly meant to cover the cases where you don't care very much about these details.

If you do care about aspect ratio, or if you want to use a different kind of resizing strategy (e.g. center crop and resize, which is popular), I would suggest implementing it in preprocessing_function, which should be applied on single images before any resizing.

@fchollet
Copy link
Contributor Author

fchollet commented Aug 5, 2019

How about using tf.image instead of PIL for image processing (types convertion, resizing and normalization)?

We want the image transformations to support both numpy arrays and tensors, so we would probably have both tf.image and scipy-based implementations.

@AakashKumarNain
Copy link

AakashKumarNain commented Aug 5, 2019

How about using tf.image instead of PIL for image processing (types convertion, resizing and normalization)?

We want the image transformations to support both numpy arrays and tensors, so we would probably have both tf.image and scipy-based implementations.

The only problem is that tf.image doesn't cover major functionalities provided by opencv and PIL as of now. Also, the implementation has differences and all users use opencv and PIL but not all of them use tf.image

#### Question: how to support image segmentation in a simple way?

**Requirements:**
- Image loading and image augmentation should be synced across inputs and targets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping things synced is simple. Once the user passes the inputs and targets, we should store the input filename and target filename in a pandas dataframe.

         inputfile            targetfile
0      path_to_file1     path_to_target1
1      path_to_file2     path_to_target2
...       ...                      ...
  1. In order to shuffle the data and keeping inputs and targets synced, we only have to shuffle the indices of this dataframe.
  2. Any augmentation that is applied on a particular input will be applied on the target as well because they can be processed by the same thread in a sequential manner now.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost what flow_from_dataframe already does. So maybe we can beef up this function to perform whatever we need.


**Requirements:**
- Image loading and image augmentation should be synced across inputs and targets
- It should be possible to use different standardization preprocessing (outside of augmentation) across inputs and targets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't preprocessing_fn be enough for that?

vertical_flip=False,
rescale=None,
preprocessing_function=None,
postprocessing_function=None,
Copy link

@PhilipMay PhilipMay Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some confusion with preprocessing_function and postprocessing_function here. The current implementation does not contrain postprocessing_function.

At line 194 below (in # PROPOSED section) it is missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

@fchollet
Copy link
Contributor Author

fchollet commented Aug 14, 2019

We will do the design review for this RFC on Monday 19th August at 2pm PT. I am leaving the comment period open until then.

@fchollet
Copy link
Contributor Author

Working list of unresolved questions:

  • Is the proposed workflow for segmentation acceptable?
  • What can we do about object detection workflows?
  • Should support for 3D image data be added?
  • What new ops should be added to tf.image to support this project?

@PhilipMay
Copy link

PhilipMay commented Aug 14, 2019

Working list of unresolved questions:
* Is the proposed workflow for segmentation acceptable?
* What can we do about object detection workflows?
* Should support for 3D image data be added?
* What new ops should be added to tf.image to support this project?

From my point of view the RFC has still a minor bug. It sais that postprocessing_function is implemented (in CURRENT). But it should be placed in PROPOSED signature.

See above: #6 (review)

@PhilipMay
Copy link

One more comment:
When you implement two parameter named preprocessing_function and postprocessing_function the postprocessing_function is new. But the semantics of the preprocessing_function will change.

Let me explain this:

Currently preprocessing_function is basicaly a postprocessing function. It happens after the rest of the augmentation has been done.

When you implement a postprocessing_function in the future the postprocessing_function will get the semantics of the old preprocessing_function. This could lead to confusion and bugs.

I suggest to rename both parameters so the old parameter not available anymore.

This presents the opportunity to improve on the existing API. In particular we don't have good support
for image segmentation use cases today.

Some features are also being subplanted by [preprocessing layers](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md), in particular text processing.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: supplanted

- Make all classes in `keras.preprocessing` JSON-serializable.
- Deprecate `Tokenizer` class in favor of `TextVectorization` preprocessing layer
- Make image-transformation functions in `affine_transformations` submodule compatible with TensorFlow tensor inputs (accepting tensors and returning tensors).
- Improve signature of the above-mentioned functions, by using fully-spelled argument names, and fewer arguments if possible.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we approaching backwards compatibility here? Presumably this pertains to keras-team/keras initially, but we will pull into TF, at which point we will need to maintain all the old APIs. What's the plan?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Keras we have a deprecation procedure in our Wiki.
We also have utils in keras to make the correct warnings.

```python
# PROPOSED

def array_to_generator(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some variation in from/to -- should these be standardized?

Copy link
Contributor Author

@fchollet fchollet Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's standardize them. We can either settle for (input)_to_(output) or (output)_from_(input)

seed=None,
subset=None):

def dataset_from_dataframe(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a world where we don't need all of these, and we can just determine data type from the passed object? String is dir path, array/dataframe can be type-checked. Then we just have to_dataset(), without the need for so many separate APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a simplification in terms of the set of methods available, but the signatures would get quite complex / confusing since each action has its own specific keyword arguments on top of the shared ones (e.g. follow_links, target_mode...)

preprocessing_function=None,
interpolation_order=1,
data_format='channels_last',
dtype='float32'):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of constructor args, and it biases us towards adding more with each new feature, rather than finding other ways of setting parameters. Is this the best way to do this, versus having subclasses that handle subsets, or deferring collecting some of these until the method in which they are needed?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can decompose this part in 2 parts: the normalization and the augmentation. The rest is shared. It may add some complexity to what the user needs to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one context in which having lots of arguments is acceptable: when they form a flat list-like configuration, i.e adding more arguments results in linear complexity and maintenance workload, because they don't interact with each other. This is the case here. We're just describing a list of image transformations. Once you have the logic infrastructure to do one image transformation, then doing arbitrarily many does not add much.


`tf.data.Dataset` is the main API for data loading and preprocessing in TensorFLow. It has two advantages:

- It supports GPU prefetching
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"It supports GPU prefetching" is a small subset of the following advantages:

  • it supports parallelization of transformations
  • it supports parallelization of I/O
  • it supports software pipelining (e.g. prefetching to GPU / TPU)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the firsts two advantages you mentioned are not there because they are already supported by the current API.

- Remove the submobules `image`, `text`, `timeseries` from the public API and expose their contents as part of the top-level `keras.preprocessing`.
- Make all classes in `keras.preprocessing` JSON-serializable.
- Deprecate `Tokenizer` class in favor of `TextVectorization` preprocessing layer
- Make image-transformation functions in `affine_transformations` submodule compatible with TensorFlow tensor inputs (accepting tensors and returning tensors).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this would necessitate new TF ops -- would we expose those in tf.image? I'd like to make that API more complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some ops from Deepmind at https://github.com/deepmind/multidim-image-augmentation

- Rename methods `flow` to `generator_from_arrays`, `flow_from_directory` to `generator_from_directory`, and `flow_from_dataframe` to `generator_to_dataframe`.
- Improve signature of the above-mentioned methods.
- Figure out how to support image segmentation use cases with `ImageDataGenerator` (open question).
- Refactor `TimeseriesGenerator` to follow a similar design as `ImageDataGenerator`. Only configuration arguments should be passed in the constructor. The data should be passed to methods such as `dataset_from_arrays`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very broad question: Is it worth having this two level structure (IIUC): A Generator class configured with a few options, which then offers methods to create generators or Datasets? Would it be easier to have a flat structure of factory methods (or classes) which simply take configuration and data and produce an appropriate Dataset / generator / array?

In other words, what does the separation of data and config buy us? Do people reuse a (say) ImageDataGenerator several times to produce different data using the same configured preprocessing?

Copy link

@Dref360 Dref360 Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think people reuse their ImageDataGenerator, but both calls to ImageDataGenerator and dataset_from_dataframe requires many arguments which might confuse the user if done in a single function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The factoring between config and data is there for 2 reasons:

  • to enable reusing a config on new data; this is how we plan to support image segmentation for instance (target masks have to be augmented in the same way as inputs)
  • makes it easier for people to think about data augmentation/normalization and loading separately, in two steps, which would otherwise form a very large and potentially confusing single step.

@fchollet
Copy link
Contributor Author

Following the design review meeting on Monday, we have decided to take more time to significantly rewrite this proposal. I'll be closing the PR for the time being, and we will open a new PR with the revised proposal at a later date.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.