Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve inconsistency of Dataset.map interface for load_from_cache_file #5514

Closed
HallerPatrick opened this issue Feb 8, 2023 · 4 comments · Fixed by #5515
Closed

Improve inconsistency of Dataset.map interface for load_from_cache_file #5514

HallerPatrick opened this issue Feb 8, 2023 · 4 comments · Fixed by #5515
Labels
enhancement New feature or request

Comments

@HallerPatrick
Copy link
Contributor

Feature request

  1. Replace the load_from_cache_file default value to True.
  2. Remove or alter checks from is_caching_enabled logic.

Motivation

I stumbled over an inconsistency in the Dataset.map interface. The documentation (and source) states for the parameter load_from_cache_file:

load_from_cache_file (`bool`, defaults to `True` if caching is enabled):
    If a cache file storing the current computation from `function`
    can be identified, use it instead of recomputing.
  1. load_from_cache_file default value is None, while being annotated as bool
  2. It is inconsistent with other method signatures like filter, that have the default value True
  3. The logic is inconsistent, as the map method checks if caching is enabled through is_caching_enabled. This logic is not used for other similar methods.

Your contribution

I am not fully aware of the logic behind caching checks. If this is just a inconsistency that historically grew, I would suggest to remove the is_caching_enabled logic as the "default" logic. Maybe someone can give insights, if environment variables have a higher priority than local variables or vice versa.

If this is clarified, I could adjust the source according to the "Feature request" section of this issue.

@HallerPatrick HallerPatrick added the enhancement New feature or request label Feb 8, 2023
@mariosasko
Copy link
Collaborator

Hi, thanks for noticing this! We can't just remove the cache control as this allows us to control where the arrow files generated by the ops are written (cached on disk if enabled or a temporary directory if disabled). The right way to address this inconsistency would be by having load_from_cache_file=None by default everywhere.

@HallerPatrick
Copy link
Contributor Author

Hi! Yes, this seems more plausible. I can implement that. One last thing is the type annotation load_from_cache_file: bool = None. Which I then would change to load_from_cache_file: Optional[bool] = None.

@HallerPatrick
Copy link
Contributor Author

PR #5515

@mariosasko
Copy link
Collaborator

Yes, Optional[bool] is the correct type annotation and thanks for the PR.

@mariosasko mariosasko linked a pull request Feb 9, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants