Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Refactor image module #107

Merged
merged 3 commits into from
Dec 26, 2018
Merged

Refactor image module #107

merged 3 commits into from
Dec 26, 2018

Conversation

rragundez
Copy link
Contributor

@rragundez rragundez commented Dec 18, 2018

Summary

I planned to update the image pre-processing part of Keras during my Christmas break to have a more complete cover of use-cases: multi-label, multi-outputs, dictionary iterator, perhaps others. I have faced some of these use-cases myself and have encounter people with the same use-cases, so I think is a good idea to include them in the image pre-processing module, while at the same time be careful of not addressing every exceptional use-case and ending up with untraceable code.

As I started I found that there is a single file containing all the logic with 2267 lines and all kind of different logic in there. I do not think this is good practice, and in the future (I think already now) maintaining and improving this file will be a pain (also the tests).

So to start I propose to break the file into a module, with a much better structure for iterators and the ImageDataGenerator containing the main logic. The classes and functions inside the image module can still be directly imported from keras_preprocessing.image as they are exposed via the __init__.py file. Therefore no change is required from imports depending on keras_preprocessing.image.

What do you think of this approach? should I continue?

Related Issues

There are no specific issues to this particular PR, as is just a refactoring, but you can already see how the quality of the code has deteriorated compared to the native Keras code, I argue that a critical part to fix bugs and contribute is to have a clean tractable structure, specially for new contributors to the project like myself. This of course will also accelerate the addition of new features on this repo to the core Keras repo.

For example after a quick look into the DataframeIterator I found many issues with it:

  • Unnecessary check for pandas as it's not used anywhere
  • Unnecessary check for string types of x_col, not Pythonic also problem because then what do you check and you don't check? e.g. y_col. If you want to use this you should use typing
  • Keeps copy of the full dataframe since start even though latter it's reduced.
  • Keeps x_col which is unnecessary, the iterator should only keep, classes and filenames
  • What's the use-case of other? if it's in order to have continuous targets, then that mode should be specified. other doesn't really add much as such to the different use cases as they are very different, but it certainly adds logic complexity. I mean just to start, the targets are saved in self.data in comparison with self.classes with other modes. other is not necessarily the problem is just that the iterators logic were initially made to handle only multi-class use-cases, and well the core logic needs to be updated to handle the other use-cases as well while at the same time keeping consistency.
  • classes is numpy array while in DirectoryGenerator is a list
  • Value error exception for classes which is not in DirectoryIterator. Classes can be set in any scenario, no idea why that couldn't be using input for example. I understand there are specific cases where indeed classes is not necessary, but the current logic does nothing to handle that.
  • Unnecessary assigning of self.classes = classes at top of init. Classes should be unique.
  • remove unnecessary check for directory None and has_ext. Directory has to be given. I think has_ext should be removed, this is the typical example where the logic tries to handle very specific cases causing a mess in the code flow and logic. The user should be responsible for giving filenames equal to filenames in the directory.
  • Unnecessary ValueError is understood that filenames contain extensions.
  • For all iterators there is confusion between classes and labels, which confuses the user. Classes are unique and labels are the class for each observation.

So like i said there is no specific Issue for this change, but the below issues could be resolved faster and more accessible to new contributors with the structure change I propose.

#99 #102 #95 #88 #67 #56

P.D. I think you should start thinking on doing the same with sequence.py and text.py as they start growing.

PR Overview

  • [n ] This PR requires new unit tests [y/n] (make sure tests are included)
  • [n] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
  • [y] This PR is backwards compatible [y/n]
  • [n] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

@rragundez rragundez changed the title Refactor image module [WIP] Refactor image module Dec 18, 2018
@rragundez rragundez changed the title [WIP] Refactor image module Refactor image module Dec 18, 2018
@Dref360
Copy link
Contributor

Dref360 commented Dec 19, 2018

Hello,
Thanks for the PR, I'll have time to review during the holidays.
Quick Note:

  • When making Interface, be sure that they do not rely on object member. Like BatchFromFilesMixin where it requires self.filenames. Use properties and/or make it an abstract class with abstract methods.
  • We should probably split utils.py in two files: one with all affine transformations and the other with the loading, saving, listing images.

@rragundez
Copy link
Contributor Author

Thanks for the feedback @Dref360 , you are right about both points. I'll rollback the Mixin commit and make this PR just about the splitting of the image module + moving tests accordingly, then I'll open another PR with the Mixin code deduplication stuff, would be easier for you to review as well. Thanks again.

@rragundez
Copy link
Contributor Author

@Dref360 I've rebased. Can you let me know what else is needed or you think we should change? Is quite a pain to rebase on every change because of the file split. So would be nice to make this change before adding more code to the image module. Just let me know, thanks!

@Dref360
Copy link
Contributor

Dref360 commented Dec 26, 2018

LGTM

@Dref360 Dref360 merged commit e25c5a1 into keras-team:master Dec 26, 2018
@rragundez rragundez deleted the refactor-image branch January 5, 2019 15:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants