Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2254] Add datatable csv dataset (#592) #616

Closed

Conversation

mlisovyi
Copy link

@mlisovyi mlisovyi commented Nov 17, 2020

Description

Adding dataset to read/write csv using datatable instead of pandas.
Resolves #592

Development notes

I have added only CSV and not Excel as a Dataset. The later generates a lot of problems, when operating with fsspec. And when i try to use the file name directly I anyway fail the basic tests as for some reason write -> read sequence on Excel file does not yield the same data type of columns datatable.Frame as in the original datatable.Frame.

To be discussed:
What is the purpose of this add-on. While implementing it, i realized that:

  • one gets a datatable.Frame object instead of pandas.DataFrame:
    • for R users, that are familiar with data.table in R this is beneficial as it provides a familiar API instead of pandas API
    • but it runs only on CSV as input
    • but it can be confusing that having CSVDataset but with a different prefix one gets an object that has a very different API
  • one gets performance gain of datatable over pandas:
    • multi-thread reading of CSV
    • optimised string storage, i.e. smaller memory footprint by object column type
    • out-of-core computation (one can in principle process data that is larger than RAM)
    • but only for CSV input and maybe it would rather make sense to make a wrapper around pandas.XXXDataset to be able to read all pandas-supported formats and then benefit from datatable performance.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
  • Added tests to cover my changes

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

  • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.

  • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

  • I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

@mlisovyi mlisovyi requested a review from idanov as a code owner November 17, 2020 10:16
Copy link
Contributor

@lucasjamar lucasjamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlisovyi I think you need to use datatable >=0.11.0 because it was not compatible to windows prior to this release: https://datatable.readthedocs.io/en/latest/releases/v0.11.0.html
Maybe this is what is causing the code to fail the windows pip compile and unit tests in CircleCI?

@mlisovyi
Copy link
Author

@lucasjamar good point. Thanks for the suggestion!
We will need to have a look into datatable version. The 0.11.0 version on linux gave me a lot of headache :( When i pip-installed it into the nominal conda environment that had all kedro-requested packages included, the environment went into some weird broken state due to version conflicts and i was not able to do do even conda list. Only uninstalling datatable with pip helped to get the environment into working state.
This requires more work to be invested to understand what goes wrong.

@lucasjamar
Copy link
Contributor

@lucasjamar good point. Thanks for the suggestion!
We will need to have a look into datatable version. The 0.11.0 version on linux gave me a lot of headache :( When i pip-installed it into the nominal conda environment that had all kedro-requested packages included, the environment went into some weird broken state due to version conflicts and i was not able to do do even conda list. Only uninstalling datatable with pip helped to get the environment into working state.
This requires more work to be invested to understand what goes wrong.

@mlisovyi I already mentionned this conda issue to datatable. Its a real bummer.

@mlisovyi
Copy link
Author

The issue with datatable has been fixed on datatable side. But we need to wait for the next bugfix(?) release (supposedly 0.11.1)

@yetudada
Copy link
Contributor

yetudada commented Dec 1, 2020

Thank you so much @mlisovyi! We'll get this reviewed!

@yetudada yetudada changed the title WIP: Add datatable csv dataset (#592) [KED-2254] Add datatable csv dataset (#592) Dec 1, 2020
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @mlisovyi ! I added a couple of questions about the implementation.

kedro/extras/datasets/datatable/csv_dataset.py Outdated Show resolved Hide resolved

with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
# convert to pandas before saving as otherwise can not use fs_file
data.to_pandas().to_csv(path_or_buf=fs_file, **self._save_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you still using pandas to save the data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that is inefficient and wierd.
The reason is that datatable doesn't seem to support file-like object as input for writing out a table. Therefore one can not benefit from access to multiple various filesystems that is provided by fsspec.open()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so the main purpose of this dataset would be that you can load data in datatable and then work with it, before saving it again as pandas? If so, I'd suggest to update the description of the class to make this clear.

Copy link
Author

@mlisovyi mlisovyi Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, correct. the main 2 benefits would be:

  • speed-up of csv reading, as datatable allows multi-thread reading from csv;
  • usage of data manipulation API that is familiar to R users, who might not have experience with pandas.

There is a nice list of advantages summarised in the original issue (#592).

Ok, i'll modify the doc-string of the class. This and the other proposed change will take a couple of days due to other tasks on my TODO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the changed have been pushed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Frame.to_csv() docs, "If no path is given, then the Frame will be serialized into a string, and that string will be returned".

So maybe we should just get such string, .encode("utf-8") it and send to fsspec instead of converting to pandas?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I do not have enough experience to judge on it. I can blindly implement your suggestion, but I will not be able to validate it due to lack of time.

@kedro-org kedro-org deleted a comment from merelcht Dec 14, 2020
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! 👍 Can you please add your name to the RELEASE.md list of contributors and we will be good to go! 🎉

kedro/extras/datasets/datatable/csv_dataset.py Outdated Show resolved Hide resolved

with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
# convert to pandas before saving as otherwise can not use fs_file
data.to_pandas().to_csv(path_or_buf=fs_file, **self._save_args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Frame.to_csv() docs, "If no path is given, then the Frame will be serialized into a string, and that string will be returned".

So maybe we should just get such string, .encode("utf-8") it and send to fsspec instead of converting to pandas?

@@ -5,6 +5,7 @@ behave==1.2.6
biopython~=1.73
black==v19.10.b0
dask[complete]~=2.6
datatable>=0.11.1, <1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datatable should also be added into extras_require in setup.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks.

I'm not really familiar with the extras_require configuration for setuptools. I have added requirements as far as I get the logic, but could you please have a look that it is correct? In particular the cross-dependency of entries with datatable and pandas requirements is not fully clear to me.

mlisovyi and others added 3 commits December 16, 2020 14:41
@mlisovyi
Copy link
Author

mlisovyi commented Jan 2, 2021

the failures in the py3.7 and py3.8 linting are very unclear to me. After a quick googling, my guess is that it is related to pylint-dev/pylint#3318. I have no clue why does it not appear in linting check with other python versions and my knowledge at this stage does not allow to pursue it further

@merelcht
Copy link
Member

merelcht commented Jan 4, 2021

the failures in the py3.7 and py3.8 linting are very unclear to me. After a quick googling, my guess is that it is related to PyCQA/pylint#3318. I have no clue why does it not appear in linting check with other python versions and my knowledge at this stage does not allow to pursue it further

We're seeing these errors on other builds as well, so they're not related to your PR. I'll investigate further.

@merelcht
Copy link
Member

Hi @mlisovyi , will you be able to finish this PR or are you happy with me jumping in and making it ready for merging?

@merelcht
Copy link
Member

merelcht commented Mar 8, 2021

Hi @mlisovyi, will you be able to address the comments on this PR? We're looking at cleaning up the open PR's, so we'll be deleting this in 2 weeks. Thanks!

@mlisovyi
Copy link
Author

Hi @MerelTheisenQB ,

my apologies, unfortunately I'm currently overwhelmed with tasks at work and I do not see me finding time in the nearest future to contribute to the project. So I think this has to be closed to reduce maintenance burden for the package maintainers. If somewhere will come back to the original issue and will work on it one can always find the branch in my fork.

Once again my apologies.

@mlisovyi mlisovyi closed this Mar 20, 2021
@merelcht
Copy link
Member

Thanks for getting back about this @mlisovyi 🙂 No worries at all! Hopefully we'll see you back on the Kedro repo again in the future 🚀

pull bot pushed a commit to vishalbelsare/kedro that referenced this pull request Apr 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate datable to Kedro extras datasets
5 participants