[KED-2254] Add datatable csv dataset (#592) #616

mlisovyi · 2020-11-17T10:16:23Z

Description

Adding dataset to read/write csv using datatable instead of pandas.
Resolves #592

Development notes

I have added only CSV and not Excel as a Dataset. The later generates a lot of problems, when operating with fsspec. And when i try to use the file name directly I anyway fail the basic tests as for some reason write -> read sequence on Excel file does not yield the same data type of columns datatable.Frame as in the original datatable.Frame.

To be discussed:
What is the purpose of this add-on. While implementing it, i realized that:

one gets a datatable.Frame object instead of pandas.DataFrame:
- for R users, that are familiar with data.table in R this is beneficial as it provides a familiar API instead of pandas API
- but it runs only on CSV as input
- but it can be confusing that having CSVDataset but with a different prefix one gets an object that has a very different API
one gets performance gain of datatable over pandas:
- multi-thread reading of CSV
- optimised string storage, i.e. smaller memory footprint by object column type
- out-of-core computation (one can in principle process data that is larger than RAM)
- but only for CSV input and maybe it would rather make sense to make a wrapper around pandas.XXXDataset to be able to read all pandas-supported formats and then benefit from datatable performance.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
Added tests to cover my changes

Notice

I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.
I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

lucasjamar

@mlisovyi I think you need to use datatable >=0.11.0 because it was not compatible to windows prior to this release: https://datatable.readthedocs.io/en/latest/releases/v0.11.0.html
Maybe this is what is causing the code to fail the windows pip compile and unit tests in CircleCI?

mlisovyi · 2020-11-18T16:14:30Z

@lucasjamar good point. Thanks for the suggestion!
We will need to have a look into datatable version. The 0.11.0 version on linux gave me a lot of headache :( When i pip-installed it into the nominal conda environment that had all kedro-requested packages included, the environment went into some weird broken state due to version conflicts and i was not able to do do even conda list. Only uninstalling datatable with pip helped to get the environment into working state.
This requires more work to be invested to understand what goes wrong.

lucasjamar · 2020-11-18T17:18:18Z

@lucasjamar good point. Thanks for the suggestion!
We will need to have a look into datatable version. The 0.11.0 version on linux gave me a lot of headache :( When i pip-installed it into the nominal conda environment that had all kedro-requested packages included, the environment went into some weird broken state due to version conflicts and i was not able to do do even conda list. Only uninstalling datatable with pip helped to get the environment into working state.
This requires more work to be invested to understand what goes wrong.

@mlisovyi I already mentionned this conda issue to datatable. Its a real bummer.

mlisovyi · 2020-11-24T08:17:40Z

The issue with datatable has been fixed on datatable side. But we need to wait for the next bugfix(?) release (supposedly 0.11.1)

yetudada · 2020-12-01T09:36:41Z

Thank you so much @mlisovyi! We'll get this reviewed!

merelcht

Thanks for the contribution @mlisovyi ! I added a couple of questions about the implementation.

kedro/extras/datasets/datatable/csv_dataset.py

merelcht · 2020-12-14T13:49:03Z

kedro/extras/datasets/datatable/csv_dataset.py

+
+        with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
+            # convert to pandas before saving as otherwise can not use fs_file
+            data.to_pandas().to_csv(path_or_buf=fs_file, **self._save_args)


Why are you still using pandas to save the data?

Indeed, that is inefficient and wierd.
The reason is that datatable doesn't seem to support file-like object as input for writing out a table. Therefore one can not benefit from access to multiple various filesystems that is provided by fsspec.open()

Right, so the main purpose of this dataset would be that you can load data in datatable and then work with it, before saving it again as pandas? If so, I'd suggest to update the description of the class to make this clear.

yes, correct. the main 2 benefits would be:

speed-up of csv reading, as datatable allows multi-thread reading from csv;

usage of data manipulation API that is familiar to R users, who might not have experience with pandas.

There is a nice list of advantages summarised in the original issue (#592).

Ok, i'll modify the doc-string of the class. This and the other proposed change will take a couple of days due to other tasks on my TODO.

ok, the changed have been pushed

According to Frame.to_csv() docs, "If no path is given, then the Frame will be serialized into a string, and that string will be returned".

So maybe we should just get such string, .encode("utf-8") it and send to fsspec instead of converting to pandas?

Unfortunately, I do not have enough experience to judge on it. I can blindly implement your suggestion, but I will not be able to validate it due to lack of time.

merelcht

This looks good to me! 👍 Can you please add your name to the RELEASE.md list of contributors and we will be good to go! 🎉

kedro/extras/datasets/datatable/csv_dataset.py

DmitriiDeriabinQB · 2020-12-16T10:36:11Z

kedro/extras/datasets/datatable/csv_dataset.py

+
+        with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
+            # convert to pandas before saving as otherwise can not use fs_file
+            data.to_pandas().to_csv(path_or_buf=fs_file, **self._save_args)


According to Frame.to_csv() docs, "If no path is given, then the Frame will be serialized into a string, and that string will be returned".

So maybe we should just get such string, .encode("utf-8") it and send to fsspec instead of converting to pandas?

DmitriiDeriabinQB · 2020-12-16T10:39:28Z

test_requirements.txt

@@ -5,6 +5,7 @@ behave==1.2.6
 biopython~=1.73
 black==v19.10.b0
 dask[complete]~=2.6
+datatable>=0.11.1, <1.0


Datatable should also be added into extras_require in setup.py

Good catch! Thanks.

I'm not really familiar with the extras_require configuration for setuptools. I have added requirements as far as I get the logic, but could you please have a look that it is correct? In particular the cross-dependency of entries with datatable and pandas requirements is not fully clear to me.

Co-authored-by: Dmitrii Deriabin <44967953+DmitriiDeriabinQB@users.noreply.github.com>

…isovyi/kedro into feature/add-datatable-dataset

mlisovyi · 2021-01-02T16:09:17Z

the failures in the py3.7 and py3.8 linting are very unclear to me. After a quick googling, my guess is that it is related to pylint-dev/pylint#3318. I have no clue why does it not appear in linting check with other python versions and my knowledge at this stage does not allow to pursue it further

merelcht · 2021-01-04T11:12:28Z

the failures in the py3.7 and py3.8 linting are very unclear to me. After a quick googling, my guess is that it is related to PyCQA/pylint#3318. I have no clue why does it not appear in linting check with other python versions and my knowledge at this stage does not allow to pursue it further

We're seeing these errors on other builds as well, so they're not related to your PR. I'll investigate further.

merelcht · 2021-01-12T10:34:06Z

Hi @mlisovyi , will you be able to finish this PR or are you happy with me jumping in and making it ready for merging?

merelcht · 2021-03-08T15:24:56Z

Hi @mlisovyi, will you be able to address the comments on this PR? We're looking at cleaning up the open PR's, so we'll be deleting this in 2 weeks. Thanks!

mlisovyi · 2021-03-20T13:00:47Z

Hi @MerelTheisenQB ,

my apologies, unfortunately I'm currently overwhelmed with tasks at work and I do not see me finding time in the nearest future to contribute to the project. So I think this has to be closed to reduce maintenance burden for the package maintainers. If somewhere will come back to the original issue and will work on it one can always find the branch in my fork.

Once again my apologies.

merelcht · 2021-03-22T09:05:39Z

Thanks for getting back about this @mlisovyi 🙂 No worries at all! Hopefully we'll see you back on the Kedro repo again in the future 🚀

mlisovyi added 2 commits November 16, 2020 09:04

add datatable csv dataset (kedro-org#592)

71db65c

add data table to test requirements

63534c8

mlisovyi requested a review from idanov as a code owner November 17, 2020 10:16

lucasjamar reviewed Nov 17, 2020

View reviewed changes

yetudada changed the title ~~WIP: Add datatable csv dataset (#592)~~ [KED-2254] Add datatable csv dataset (#592) Dec 1, 2020

Merge branch 'master' into feature/add-datatable-dataset

96ab010

merelcht reviewed Dec 14, 2020

View reviewed changes

kedro-org deleted a comment from merelcht Dec 14, 2020

idanov assigned merelcht Dec 14, 2020

mlisovyi added 4 commits December 15, 2020 20:09

update to datatable 0.11.1 to allow win compatibility

083471e

drop outdated fsspec load mode

88f41ef

extend doc string od datatable CSV reader

49962fc

fix datatable test checking outdated fsspec mode

fe8aab4

merelcht approved these changes Dec 16, 2020

View reviewed changes

kedro/extras/datasets/datatable/csv_dataset.py Show resolved Hide resolved

DmitriiDeriabinQB reviewed Dec 16, 2020

View reviewed changes

mlisovyi and others added 3 commits December 16, 2020 14:41

Update kedro/extras/datasets/datatable/csv_dataset.py

c5271b0

Co-authored-by: Dmitrii Deriabin <44967953+DmitriiDeriabinQB@users.noreply.github.com>

add datatable to the requirements in setup.py

5c8bdc4

Merge branch 'feature/add-datatable-dataset' of https://github.com/ml…

cdcf7f0

…isovyi/kedro into feature/add-datatable-dataset

Merge branch 'master' into feature/add-datatable-dataset

3b89a6d

Merge branch 'master' into feature/add-datatable-dataset

ee31390

mlisovyi closed this Mar 20, 2021

pull bot pushed a commit to vishalbelsare/kedro that referenced this pull request Apr 4, 2021

Update release notes template following 0.16.1 release (kedro-org#616)

763c676

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-2254] Add datatable csv dataset (#592) #616

[KED-2254] Add datatable csv dataset (#592) #616

mlisovyi commented Nov 17, 2020 •

edited

Loading

lucasjamar left a comment •

edited

Loading

mlisovyi commented Nov 18, 2020

lucasjamar commented Nov 18, 2020

mlisovyi commented Nov 24, 2020

yetudada commented Dec 1, 2020

merelcht left a comment

merelcht Dec 14, 2020

mlisovyi Dec 14, 2020

merelcht Dec 15, 2020

mlisovyi Dec 15, 2020 •

edited

Loading

mlisovyi Dec 16, 2020

DmitriiDeriabinQB Dec 16, 2020

mlisovyi Dec 22, 2020

merelcht left a comment

DmitriiDeriabinQB Dec 16, 2020

DmitriiDeriabinQB Dec 16, 2020

mlisovyi Dec 22, 2020

mlisovyi commented Jan 2, 2021 •

edited

Loading

merelcht commented Jan 4, 2021

merelcht commented Jan 12, 2021

merelcht commented Mar 8, 2021

mlisovyi commented Mar 20, 2021

merelcht commented Mar 22, 2021

[KED-2254] Add datatable csv dataset (#592) #616

[KED-2254] Add datatable csv dataset (#592) #616

Conversation

mlisovyi commented Nov 17, 2020 • edited Loading

Description

Development notes

Checklist

Notice

lucasjamar left a comment • edited Loading

Choose a reason for hiding this comment

mlisovyi commented Nov 18, 2020

lucasjamar commented Nov 18, 2020

mlisovyi commented Nov 24, 2020

yetudada commented Dec 1, 2020

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlisovyi Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlisovyi commented Jan 2, 2021 • edited Loading

merelcht commented Jan 4, 2021

merelcht commented Jan 12, 2021

merelcht commented Mar 8, 2021

mlisovyi commented Mar 20, 2021

merelcht commented Mar 22, 2021

mlisovyi commented Nov 17, 2020 •

edited

Loading

lucasjamar left a comment •

edited

Loading

mlisovyi Dec 15, 2020 •

edited

Loading

mlisovyi commented Jan 2, 2021 •

edited

Loading