-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: hdfs: use pyarrow and connection pool #2297
Conversation
effac7f
to
70591e8
Compare
Related to iterative#1629 Speeds up HDFS tests significantly. E.g. ExternalHDFS test goes from 600 sec to 190sec. test_open_external[HDFS] from 160sec to 3sec. Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
@@ -65,6 +94,7 @@ def _group(regex, s, gname): | |||
return match.group(gname) | |||
|
|||
def get_file_checksum(self, path_info): | |||
# NOTE: pyarrow doesn't support checksum, so we need to use hadoop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how this is implemented it might be faster to just open a file and calculate checksum yourself. This will also remove dep on hadoop cli.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor I agree, but I was not able to confirm that yet. We've discussed it during our private call 🙂 For now, tests went from 20minutes to 12minutes on cron jobs on py3.7, which is very good. Dropping hadoop cli there would make it even better, I will be sure to take a look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see .info()
method in pyarrow, does it have checksum there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of luck) we should look into adding checksum to pyarrow then, or at least open an issue.
For the record: found a bug and submitted a patch [1] to pyarrow to include openjdk-8 into automatic search paths and it got merged, but the release will be in a few months 🙁 Until then some users will need to use JAVA_HOME, which is something that java users expect to have to configure (at least sometimes). |
@efiop we might want to update our docs with |
For the record, added iterative/dvc.org#493 and https://issues.apache.org/jira/browse/ARROW-5995 |
Related to #1629
Speeds up HDFS tests significantly. E.g.
TestReproExternalHDFS from 600sec to 190sec (we are still using hadoop cli there to get checksum).
test_open_external[HDFS] from 160sec to 3sec.
Signed-off-by: Ruslan Kuprieiev ruslan@iterative.ai
Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.