Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: hdfs: use pyarrow and connection pool #2297

Merged
merged 1 commit into from
Jul 19, 2019
Merged

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Jul 19, 2019

Related to #1629

Speeds up HDFS tests significantly. E.g.
TestReproExternalHDFS from 600sec to 190sec (we are still using hadoop cli there to get checksum).
test_open_external[HDFS] from 160sec to 3sec.

Signed-off-by: Ruslan Kuprieiev ruslan@iterative.ai

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


@efiop efiop requested a review from Suor July 19, 2019 09:22
@efiop efiop force-pushed the hdfs branch 12 times, most recently from effac7f to 70591e8 Compare July 19, 2019 20:35
@efiop efiop changed the title [WIP] remote: hdfs: use pyarrow and connection pool remote: hdfs: use pyarrow and connection pool Jul 19, 2019
Related to iterative#1629

Speeds up HDFS tests significantly. E.g. ExternalHDFS test goes from 600
sec to 190sec. test_open_external[HDFS] from 160sec to 3sec.

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
@efiop efiop merged commit 3becd5c into iterative:master Jul 19, 2019
@efiop efiop deleted the hdfs branch July 19, 2019 23:15
@@ -65,6 +94,7 @@ def _group(regex, s, gname):
return match.group(gname)

def get_file_checksum(self, path_info):
# NOTE: pyarrow doesn't support checksum, so we need to use hadoop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how this is implemented it might be faster to just open a file and calculate checksum yourself. This will also remove dep on hadoop cli.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor I agree, but I was not able to confirm that yet. We've discussed it during our private call 🙂 For now, tests went from 20minutes to 12minutes on cron jobs on py3.7, which is very good. Dropping hadoop cli there would make it even better, I will be sure to take a look at it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see .info() method in pyarrow, does it have checksum there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of luck) we should look into adding checksum to pyarrow then, or at least open an issue.

@efiop
Copy link
Contributor Author

efiop commented Jul 21, 2019

For the record: found a bug and submitted a patch [1] to pyarrow to include openjdk-8 into automatic search paths and it got merged, but the release will be in a few months 🙁 Until then some users will need to use JAVA_HOME, which is something that java users expect to have to configure (at least sometimes).

[1] apache/arrow#4907

@Suor
Copy link
Contributor

Suor commented Jul 21, 2019

@efiop we might want to update our docs with JAVA_HOME instructions, for python users failed loading libjvm message is something unknown.

@efiop
Copy link
Contributor Author

efiop commented Jul 21, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants