-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(datasets): Take mode argument into account when saving dataset #805
Conversation
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Is there an append+binary mode in Python? The hardcoded If this work we don't need a new dataset in our docs here: https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#saving-data-with-generators |
"ab" seems to work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @merelcht, the approach looks good to me.
I would go with the docstring update, so all allowed mode
options are reflected in the documentation. From our user research interview, that's the first place users look. I don't think we should replace w
with wb
and hide working with binary file objects.
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…dro-org/kedro-plugins into fix/handle-mode-arg-correctly
…tead Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Few comments:
See this snippet below, because we are writing to buffer (not sure what's the benefit). The with open("tmp_wb", mode="wb") as f:
buffer = BytesIO()
df.to_csv(buffer, index=False, mode="w") # "w" can be anything here and doesn't get used base on some testing.
f.write(buffer.getvalue()) So the problem is, this create some bad UX because very few people know how to use fsspec specific arguments. To use my_dataset:
type: pandas.CSVDataset
load_args:
fs_args:
open_load_args:
mode: wb We usually have parity between kedro's datasets versus the underlying mapping, in this case the my_dataset:
type: pandas.CSVDataset
load_args:
mode: wb # Maybe we can manually map this into fsspec args, it solves the append problem but not 'w'.
# fs_args:
# open_load_args:
# mode: wb |
@noklam I totally agree with you that it's bad UX! I'm not even that sure what the purpose is for users of
The problem I ran into is that the same mode can't be used as |
We do actually have docs about this: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-data-from-a-local-binary-file-using-utf-8-encoding |
There are some use cases, but you have to dig in fsspec docs itself. For example, using See: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.find |
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Upon further investigation and trying another solution, I've actually come to the conclusion that in order to make this the most consistent across various datasets, the best solution is to move the hard-coded mode argument to While in I still think @noklam has a great point in UX not being very good here, because this solution requires a catalog entry like:
To be fair, this is "expected" and explained in our docs: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-data-from-a-local-binary-file-using-utf-8-encoding Where possible I will update datasets to not use byte conversion, so at least it's not necessary to pass "wb"/"ab" etc.. to mode. |
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…dro-org/kedro-plugins into fix/handle-mode-arg-correctly
I'm glad you dug this up @noklam because it's exactly the same I'm proposing for Polars... #625 but looks like the bottleneck is our versioning. It's not a discussion for this PR but
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good idea to move those default up and making it explicit. As I understand this is mostly a refactoring change, the default are not changed for any datasets.
This is good enough to address the original issue, the follow up question I have is, is our doc clear enough about this? fsspec
is only used for versioned dataset, and it is an implementation detail. Most user wouldn't know immediately but at least it is now a bit more visible in docs.
I 100% agree with the point about the UX. Should we add some clarifications and examples in the docs until we address the UX problem? For users, it can be not very clear what happens if:
Also, it might not be clear what the expected way to append (or apply some other setting than default) is - set just Edit: we can add some examples/clarifications here: https://docs.kedro.org/en/stable/data/data_catalog.html#load-and-save-arguments or here: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-data-from-a-local-binary-file-using-utf-8-encoding |
@ElenaKhaustova and @noklam, thanks for the reviews! I agree that some doc updates are needed as well. I will create a PR on the |
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…dro-org/kedro-plugins into fix/handle-mode-arg-correctly
…edro-org#805) * Take mode argument into account when saving dataset * Move mode to default save args * Revert changes to add mode to save args and add as fs arg default instead * Separate fs save and load args again * Add tests for coverage * Use fs_args to pass mode for all pandas based datasets * Make other datasets use fs_args for handling mode as well * Refactor and make all datasets consistent * Update release notes --------- Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Description
Closes #336 and #513
Development notes
First attempt:
Moved the hard-coded mode argument to the
DEFAULT_SAVE_ARGS
on top. A caveat here is that the mode must be a "bytes" mode, so it should bewb
,ab
etc, instead of justw
,a
...I could either add some code to check if the mode provided ends in
b
and add it otherwise, or I can update the docstring to mention this.This didn't work in the end, because other datasets use save args and fs args separately.
Second attempt:
Moved the hard-coded mode argument to the
DEFAULT_FS_ARGS
on top. Usingfs_args
to propagate the defaults and read user provided values. This should work for more datasets, I've tried withpandas.CSVDataset
andpandas.JSONDataset
but want to wait with changing this in more instances until we have consensus on the approach.It feels like I'm reverting part of https://github.com/McK-Private/private-kedro/pull/1118 so would like to hear what others think.
Checklist
RELEASE.md
file