-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KED-2254] Add datatable csv dataset (#592) #616
[KED-2254] Add datatable csv dataset (#592) #616
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlisovyi I think you need to use datatable >=0.11.0 because it was not compatible to windows prior to this release: https://datatable.readthedocs.io/en/latest/releases/v0.11.0.html
Maybe this is what is causing the code to fail the windows pip compile and unit tests in CircleCI?
@lucasjamar good point. Thanks for the suggestion! |
@mlisovyi I already mentionned this conda issue to datatable. Its a real bummer. |
The issue with datatable has been fixed on datatable side. But we need to wait for the next bugfix(?) release (supposedly 0.11.1) |
Thank you so much @mlisovyi! We'll get this reviewed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @mlisovyi ! I added a couple of questions about the implementation.
|
||
with self._fs.open(save_path, **self._fs_open_args_save) as fs_file: | ||
# convert to pandas before saving as otherwise can not use fs_file | ||
data.to_pandas().to_csv(path_or_buf=fs_file, **self._save_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you still using pandas
to save the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that is inefficient and wierd.
The reason is that datatable doesn't seem to support file-like object as input for writing out a table. Therefore one can not benefit from access to multiple various filesystems that is provided by fsspec.open()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so the main purpose of this dataset would be that you can load data in datatable
and then work with it, before saving it again as pandas? If so, I'd suggest to update the description of the class to make this clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, correct. the main 2 benefits would be:
- speed-up of csv reading, as
datatable
allows multi-thread reading from csv; - usage of data manipulation API that is familiar to R users, who might not have experience with
pandas
.
There is a nice list of advantages summarised in the original issue (#592).
Ok, i'll modify the doc-string of the class. This and the other proposed change will take a couple of days due to other tasks on my TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, the changed have been pushed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Frame.to_csv()
docs, "If no path is given, then the Frame will be serialized into a string, and that string will be returned".
So maybe we should just get such string, .encode("utf-8")
it and send to fsspec instead of converting to pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I do not have enough experience to judge on it. I can blindly implement your suggestion, but I will not be able to validate it due to lack of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! 👍 Can you please add your name to the RELEASE.md
list of contributors and we will be good to go! 🎉
|
||
with self._fs.open(save_path, **self._fs_open_args_save) as fs_file: | ||
# convert to pandas before saving as otherwise can not use fs_file | ||
data.to_pandas().to_csv(path_or_buf=fs_file, **self._save_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Frame.to_csv()
docs, "If no path is given, then the Frame will be serialized into a string, and that string will be returned".
So maybe we should just get such string, .encode("utf-8")
it and send to fsspec instead of converting to pandas?
@@ -5,6 +5,7 @@ behave==1.2.6 | |||
biopython~=1.73 | |||
black==v19.10.b0 | |||
dask[complete]~=2.6 | |||
datatable>=0.11.1, <1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datatable should also be added into extras_require
in setup.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks.
I'm not really familiar with the extras_require
configuration for setuptools. I have added requirements as far as I get the logic, but could you please have a look that it is correct? In particular the cross-dependency of entries with datatable and pandas requirements is not fully clear to me.
Co-authored-by: Dmitrii Deriabin <44967953+DmitriiDeriabinQB@users.noreply.github.com>
…isovyi/kedro into feature/add-datatable-dataset
the failures in the py3.7 and py3.8 linting are very unclear to me. After a quick googling, my guess is that it is related to pylint-dev/pylint#3318. I have no clue why does it not appear in linting check with other python versions and my knowledge at this stage does not allow to pursue it further |
We're seeing these errors on other builds as well, so they're not related to your PR. I'll investigate further. |
Hi @mlisovyi , will you be able to finish this PR or are you happy with me jumping in and making it ready for merging? |
Hi @mlisovyi, will you be able to address the comments on this PR? We're looking at cleaning up the open PR's, so we'll be deleting this in 2 weeks. Thanks! |
Hi @MerelTheisenQB , my apologies, unfortunately I'm currently overwhelmed with tasks at work and I do not see me finding time in the nearest future to contribute to the project. So I think this has to be closed to reduce maintenance burden for the package maintainers. If somewhere will come back to the original issue and will work on it one can always find the branch in my fork. Once again my apologies. |
Thanks for getting back about this @mlisovyi 🙂 No worries at all! Hopefully we'll see you back on the Kedro repo again in the future 🚀 |
Description
Adding dataset to read/write csv using datatable instead of pandas.
Resolves #592
Development notes
I have added only CSV and not Excel as a Dataset. The later generates a lot of problems, when operating with
fsspec
. And when i try to use the file name directly I anyway fail the basic tests as for some reason write -> read sequence on Excel file does not yield the same data type of columnsdatatable.Frame
as in the originaldatatable.Frame
.To be discussed:
What is the purpose of this add-on. While implementing it, i realized that:
datatable.Frame
object instead ofpandas.DataFrame
:data.table
in R this is beneficial as it provides a familiar API instead of pandas APIobject
column typepandas.XXXDataset
to be able to read all pandas-supported formats and then benefit from datatable performance.Checklist
RELEASE.md
fileNotice
I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.
I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.