Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KED-2254] Add datatable csv dataset (#592) #616
[KED-2254] Add datatable csv dataset (#592) #616
Changes from 7 commits
71db65c
63534c8
96ab010
083471e
88f41ef
49962fc
fe8aab4
c5271b0
5c8bdc4
cdcf7f0
3b89a6d
ee31390
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you still using
pandas
to save the data?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that is inefficient and wierd.
The reason is that datatable doesn't seem to support file-like object as input for writing out a table. Therefore one can not benefit from access to multiple various filesystems that is provided by
fsspec.open()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so the main purpose of this dataset would be that you can load data in
datatable
and then work with it, before saving it again as pandas? If so, I'd suggest to update the description of the class to make this clear.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, correct. the main 2 benefits would be:
datatable
allows multi-thread reading from csv;pandas
.There is a nice list of advantages summarised in the original issue (#592).
Ok, i'll modify the doc-string of the class. This and the other proposed change will take a couple of days due to other tasks on my TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, the changed have been pushed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to
Frame.to_csv()
docs, "If no path is given, then the Frame will be serialized into a string, and that string will be returned".So maybe we should just get such string,
.encode("utf-8")
it and send to fsspec instead of converting to pandas?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I do not have enough experience to judge on it. I can blindly implement your suggestion, but I will not be able to validate it due to lack of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datatable should also be added into
extras_require
insetup.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks.
I'm not really familiar with the
extras_require
configuration for setuptools. I have added requirements as far as I get the logic, but could you please have a look that it is correct? In particular the cross-dependency of entries with datatable and pandas requirements is not fully clear to me.