fix(datasets): Take mode argument into account when saving dataset #805

merelcht · 2024-08-07T14:50:26Z

Description

Closes #336 and #513

Development notes

First attempt:
Moved the hard-coded mode argument to the DEFAULT_SAVE_ARGS on top. A caveat here is that the mode must be a "bytes" mode, so it should be wb, ab etc, instead of just w, a...

I could either add some code to check if the mode provided ends in b and add it otherwise, or I can update the docstring to mention this.

This didn't work in the end, because other datasets use save args and fs args separately.

Second attempt:
Moved the hard-coded mode argument to the DEFAULT_FS_ARGS on top. Using fs_args to propagate the defaults and read user provided values. This should work for more datasets, I've tried with pandas.CSVDataset and pandas.JSONDataset but want to wait with changing this in more instances until we have consensus on the approach.

It feels like I'm reverting part of https://github.com/McK-Private/private-kedro/pull/1118 so would like to hear what others think.

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

noklam · 2024-08-07T14:59:29Z

Is there an append+binary mode in Python? The hardcoded wb is definitely a problem but I think the real problem is we use buffer but this does not seem to support append.

If this work we don't need a new dataset in our docs here: https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#saving-data-with-generators

merelcht · 2024-08-07T15:01:15Z

Is there an append+binary mode in Python? The hardcoded wb is definitely a problem but I think the real problem is we use buffer but this does not seem to support append.

If this work we don't need a new dataset in our docs here: https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#saving-data-with-generators

"ab" seems to work

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

ElenaKhaustova

Thank you, @merelcht, the approach looks good to me.

I would go with the docstring update, so all allowed mode options are reflected in the documentation. From our user research interview, that's the first place users look. I don't think we should replace w with wb and hide working with binary file objects.

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

…dro-org/kedro-plugins into fix/handle-mode-arg-correctly

…tead Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

noklam · 2024-08-16T10:17:20Z

Few comments:

https://github.com/McK-Private/private-kedro/pull/1118 try to delegate to pandas as much as possible, which I think is a good idea.
Does w mode work at all? My impression is since we already use BytesIO, it's impossible to use w mode anymore.

See this snippet below, because we are writing to buffer (not sure what's the benefit). The mode argument are not being used at all.

with open("tmp_wb", mode="wb") as f:
    buffer = BytesIO()
    df.to_csv(buffer, index=False, mode="w") # "w" can be anything here and doesn't get used base on some testing.
    f.write(buffer.getvalue())

So the problem is, this create some bad UX because very few people know how to use fsspec specific arguments. To use pandas.CSVDataset with different mode, they will need to define dataset as follow:

my_dataset:
  type: pandas.CSVDataset
  load_args:
    fs_args:
      open_load_args:
        mode: wb

We usually have parity between kedro's datasets versus the underlying mapping, in this case the mode argument does not do anything. i.e. csv_data.load(load_args = load_args) should be equivalent to pd.read_csv(load_args).
This is super awkward to write and we probably don't have docs about this. I'll much prefer if we can have this instead:

my_dataset:
  type: pandas.CSVDataset
  
  load_args:
    mode: wb # Maybe we can manually map this into fsspec args, it solves the append problem but not 'w'.
#    fs_args:
#      open_load_args:
#        mode: wb

merelcht · 2024-08-16T10:45:07Z

@noklam I totally agree with you that it's bad UX! I'm not even that sure what the purpose is for users of fs_args.. Inside the dataset implementation before my changes, you can see that save_args are sent to the dataset specific API, but for the fs part the mode is hardcoded.

data.to_json(path_or_buf=buf, **self._save_args)

with self._fs.open(save_path, mode="wb") as fs_file:

The problem I ran into is that the same mode can't be used as save_args and as fs_args and there's another issue when storage_options are initialised. I have tried to find the reason for the bytes conversion, but all I can find is that it was introduced in https://github.com/McK-Private/private-kedro/pull/1118. Maybe the true fix is getting rid of all the byte conversion stuff so save/load args can be passed on to the fs operations as well.

merelcht · 2024-08-16T10:46:25Z

We do actually have docs about this: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-data-from-a-local-binary-file-using-utf-8-encoding

noklam · 2024-08-16T10:58:12Z

I totally agree with you that it's bad UX! I'm not even that sure what the purpose is for users of fs_args.. Inside the dataset implementation before my changes, you can see that save_args are sent to the dataset specific API, but for the fs part the mode is hardcoded.

There are some use cases, but you have to dig in fsspec docs itself. For example, using fs_args for PartitionedDataset can allow users to pass in a regex to search for a specific pattern.

See: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.find
and https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-4.1.0/api/kedro_datasets.partitions.PartitionedDataset.html#:~:text=load_args%20(Optional%5Bdict%5Bstr%2C%20Any%5D%5D)%20%E2%80%93%20Keyword%20arguments%20to%20be%20passed%20into%20find()%20method%20of%20the%20filesystem%20implementation.

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht · 2024-08-20T12:43:17Z

Upon further investigation and trying another solution, I've actually come to the conclusion that in order to make this the most consistent across various datasets, the best solution is to move the hard-coded mode argument to DEFAULT_FS_ARGS on top and use fs_args to propagate the defaults and read user provided values.

While in pandas.CSVDataset and pandas.JSONDataset it seemed that fs_args and save_args can be used interchangeably, this is not the case for all datasets. See for example, pandas.FeatherDataset, mode is not an accepted value for save_args (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html) passed to dataframe.to_feather(...) and so it's required that any mode arguments come through fs_args.

I still think @noklam has a great point in UX not being very good here, because this solution requires a catalog entry like:

my_dataset:
  type: pandas.CSVDataset
  fs_args:
      open_save_args:
        mode: wb

To be fair, this is "expected" and explained in our docs: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-data-from-a-local-binary-file-using-utf-8-encoding
But for the time being I suggest we tackle that UX separately.

Where possible I will update datasets to not use byte conversion, so at least it's not necessary to pass "wb"/"ab" etc.. to mode.

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

…dro-org/kedro-plugins into fix/handle-mode-arg-correctly

astrojuanlu · 2024-08-21T16:52:59Z

https://github.com/McK-Private/private-kedro/pull/1118 try to delegate to pandas as much as possible, which I think is a good idea.

I'm glad you dug this up @noklam because it's exactly the same I'm proposing for Polars... #625 but looks like the bottleneck is our versioning.

It's not a discussion for this PR but

At this point we should start by documenting what the "social contract" of Kedro datasets is.

noklam

I think that's a good idea to move those default up and making it explicit. As I understand this is mostly a refactoring change, the default are not changed for any datasets.

This is good enough to address the original issue, the follow up question I have is, is our doc clear enough about this? fsspec is only used for versioned dataset, and it is an implementation detail. Most user wouldn't know immediately but at least it is now a bit more visible in docs.

See: https://kedro--805.org.readthedocs.build/projects/kedro-datasets/en/805/api/kedro_datasets.pandas.CSVDataset.html

ElenaKhaustova · 2024-08-22T09:45:46Z

While in pandas.CSVDataset and pandas.JSONDataset it seemed that fs_args and save_args can be used interchangeably, this is not the case for all datasets.

I 100% agree with the point about the UX. Should we add some clarifications and examples in the docs until we address the UX problem? For users, it can be not very clear what happens if:

my_dataset:
  type: pandas.CSVDataset
  fs_args:
      open_save_args:
      mode: wb
  save_args:
    mode: a

Also, it might not be clear what the expected way to append (or apply some other setting than default) is - set just fs_args/save_args or both.

Edit: we can add some examples/clarifications here: https://docs.kedro.org/en/stable/data/data_catalog.html#load-and-save-arguments or here: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-data-from-a-local-binary-file-using-utf-8-encoding

merelcht · 2024-08-22T10:24:59Z

@ElenaKhaustova and @noklam, thanks for the reviews! I agree that some doc updates are needed as well. I will create a PR on the kedro repo to do this.

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

…dro-org/kedro-plugins into fix/handle-mode-arg-correctly

…edro-org#805) * Take mode argument into account when saving dataset * Move mode to default save args * Revert changes to add mode to save args and add as fs arg default instead * Separate fs save and load args again * Add tests for coverage * Use fs_args to pass mode for all pandas based datasets * Make other datasets use fs_args for handling mode as well * Refactor and make all datasets consistent * Update release notes --------- Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Take mode argument into account when saving dataset

3fba245

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht self-assigned this Aug 7, 2024

merelcht requested review from noklam and lrcouto August 7, 2024 14:50

merelcht and others added 2 commits August 7, 2024 17:00

Fix test

ed881fe

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'main' into fix/handle-mode-arg-correctly

49017f6

merelcht requested a review from ElenaKhaustova August 13, 2024 09:21

ElenaKhaustova approved these changes Aug 13, 2024

View reviewed changes

merelcht and others added 8 commits August 13, 2024 15:41

Move mode to default save args

28d173f

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'fix/handle-mode-arg-correctly' of https://github.com/ke…

e1c59e8

…dro-org/kedro-plugins into fix/handle-mode-arg-correctly

Revert changes to add mode to save args and add as fs arg default ins…

e71b810

…tead Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'main' into fix/handle-mode-arg-correctly

eef2651

Fix lint

079b0a3

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Separate fs save and load args again

1026380

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Add tests for coverage

0725a96

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Try simplify init

c7ba314

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht added 2 commits August 20, 2024 11:55

Remove writing to buffer and use save_args both in saving and conversion

9edd0f9

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Fix tests

5905ff6

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht added 5 commits August 20, 2024 13:54

Revert back to using fs_args to pass mode argument

0de251d

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Use fs_args to pass mode for all pandas based datasets

3238997

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Make other datasets use fs_args for handling mode as well

ef11faa

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Refactor and make all datasets consistent

97e49a2

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Fix message dataset

46aef8d

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Fix tests

2aa49a8

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht marked this pull request as ready for review August 21, 2024 08:23

merelcht and others added 3 commits August 21, 2024 11:04

Merge branch 'main' into fix/handle-mode-arg-correctly

86fe86f

Clean up

9e7c363

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'fix/handle-mode-arg-correctly' of https://github.com/ke…

ea4556e

…dro-org/kedro-plugins into fix/handle-mode-arg-correctly

merelcht requested a review from ElenaKhaustova August 21, 2024 12:30

Merge branch 'main' into fix/handle-mode-arg-correctly

8f084de

noklam approved these changes Aug 22, 2024

View reviewed changes

ElenaKhaustova approved these changes Aug 22, 2024

View reviewed changes

merelcht and others added 3 commits August 22, 2024 13:59

Merge branch 'main' into fix/handle-mode-arg-correctly

e6825a7

Update release notes

0fb3dac

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'fix/handle-mode-arg-correctly' of https://github.com/ke…

806d8f4

…dro-org/kedro-plugins into fix/handle-mode-arg-correctly

merelcht enabled auto-merge (squash) August 22, 2024 13:06

merelcht merged commit c67fa9e into main Aug 22, 2024
14 checks passed

merelcht deleted the fix/handle-mode-arg-correctly branch August 22, 2024 13:16

This was referenced Aug 22, 2024

Add mode as argument to pandas.CSVDataSet and other datasets #513

Closed

Add extra clarification about fs_args kedro-org/kedro#4112

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datasets): Take mode argument into account when saving dataset #805

fix(datasets): Take mode argument into account when saving dataset #805

merelcht commented Aug 7, 2024 •

edited

Loading

noklam commented Aug 7, 2024

merelcht commented Aug 7, 2024

ElenaKhaustova left a comment

noklam commented Aug 16, 2024 •

edited

Loading

merelcht commented Aug 16, 2024

merelcht commented Aug 16, 2024

noklam commented Aug 16, 2024

merelcht commented Aug 20, 2024 •

edited

Loading

astrojuanlu commented Aug 21, 2024

noklam left a comment

ElenaKhaustova commented Aug 22, 2024 •

edited

Loading

merelcht commented Aug 22, 2024

fix(datasets): Take mode argument into account when saving dataset #805

fix(datasets): Take mode argument into account when saving dataset #805

Conversation

merelcht commented Aug 7, 2024 • edited Loading

Description

Development notes

Checklist

noklam commented Aug 7, 2024

merelcht commented Aug 7, 2024

ElenaKhaustova left a comment

Choose a reason for hiding this comment

noklam commented Aug 16, 2024 • edited Loading

merelcht commented Aug 16, 2024

merelcht commented Aug 16, 2024

noklam commented Aug 16, 2024

merelcht commented Aug 20, 2024 • edited Loading

astrojuanlu commented Aug 21, 2024

noklam left a comment

Choose a reason for hiding this comment

ElenaKhaustova commented Aug 22, 2024 • edited Loading

merelcht commented Aug 22, 2024

merelcht commented Aug 7, 2024 •

edited

Loading

noklam commented Aug 16, 2024 •

edited

Loading

merelcht commented Aug 20, 2024 •

edited

Loading

ElenaKhaustova commented Aug 22, 2024 •

edited

Loading