remote: hdfs: use pyarrow and connection pool #2297

efiop · 2019-07-19T01:11:14Z

Related to #1629

Speeds up HDFS tests significantly. E.g.
TestReproExternalHDFS from 600sec to 190sec (we are still using hadoop cli there to get checksum).
test_open_external[HDFS] from 160sec to 3sec.

Signed-off-by: Ruslan Kuprieiev ruslan@iterative.ai

Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.

Related to iterative#1629 Speeds up HDFS tests significantly. E.g. ExternalHDFS test goes from 600 sec to 190sec. test_open_external[HDFS] from 160sec to 3sec. Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

Suor · 2019-07-20T07:38:49Z

dvc/remote/hdfs.py

@@ -65,6 +94,7 @@ def _group(regex, s, gname):
        return match.group(gname)

    def get_file_checksum(self, path_info):
+        # NOTE: pyarrow doesn't support checksum, so we need to use hadoop


Depending on how this is implemented it might be faster to just open a file and calculate checksum yourself. This will also remove dep on hadoop cli.

@Suor I agree, but I was not able to confirm that yet. We've discussed it during our private call 🙂 For now, tests went from 20minutes to 12minutes on cron jobs on py3.7, which is very good. Dropping hadoop cli there would make it even better, I will be sure to take a look at it.

I also see .info() method in pyarrow, does it have checksum there?

It doesn't.

Out of luck) we should look into adding checksum to pyarrow then, or at least open an issue.

efiop · 2019-07-21T01:22:22Z

For the record: found a bug and submitted a patch [1] to pyarrow to include openjdk-8 into automatic search paths and it got merged, but the release will be in a few months 🙁 Until then some users will need to use JAVA_HOME, which is something that java users expect to have to configure (at least sometimes).

[1] apache/arrow#4907

Suor · 2019-07-21T01:55:34Z

@efiop we might want to update our docs with JAVA_HOME instructions, for python users failed loading libjvm message is something unknown.

iterative/dvc#2297 (comment)

efiop · 2019-07-21T15:28:31Z

For the record, added iterative/dvc.org#493 and https://issues.apache.org/jira/browse/ARROW-5995

efiop force-pushed the hdfs branch from 3060fe1 to 0e57b13 Compare July 19, 2019 09:08

efiop requested a review from Suor July 19, 2019 09:22

efiop force-pushed the hdfs branch 12 times, most recently from effac7f to 70591e8 Compare July 19, 2019 20:35

efiop changed the title ~~[WIP] remote: hdfs: use pyarrow and connection pool~~ remote: hdfs: use pyarrow and connection pool Jul 19, 2019

remote: hdfs: use pyarrow and connection pool

68bebe7

Related to iterative#1629 Speeds up HDFS tests significantly. E.g. ExternalHDFS test goes from 600 sec to 190sec. test_open_external[HDFS] from 160sec to 3sec. Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

efiop force-pushed the hdfs branch from 70591e8 to 68bebe7 Compare July 19, 2019 22:12

efiop merged commit 3becd5c into iterative:master Jul 19, 2019

efiop deleted the hdfs branch July 19, 2019 23:15

Suor reviewed Jul 20, 2019

View reviewed changes

efiop added a commit to iterative/dvc.org that referenced this pull request Jul 21, 2019

docs: remote-add: hdfs: add JAVA_HOME note

3dfeef5

iterative/dvc#2297 (comment)

efiop mentioned this pull request Jul 21, 2019

docs: remote-add: hdfs: add JAVA_HOME note iterative/dvc.org#493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote: hdfs: use pyarrow and connection pool #2297

remote: hdfs: use pyarrow and connection pool #2297

efiop commented Jul 19, 2019 •

edited

Loading

Suor Jul 20, 2019

efiop Jul 20, 2019

Suor Jul 21, 2019

efiop Jul 21, 2019

Suor Jul 21, 2019

efiop commented Jul 21, 2019

Suor commented Jul 21, 2019

efiop commented Jul 21, 2019

remote: hdfs: use pyarrow and connection pool #2297

remote: hdfs: use pyarrow and connection pool #2297

Conversation

efiop commented Jul 19, 2019 • edited Loading

Suor Jul 20, 2019

Choose a reason for hiding this comment

efiop Jul 20, 2019

Choose a reason for hiding this comment

Suor Jul 21, 2019

Choose a reason for hiding this comment

efiop Jul 21, 2019

Choose a reason for hiding this comment

Suor Jul 21, 2019

Choose a reason for hiding this comment

efiop commented Jul 21, 2019

Suor commented Jul 21, 2019

efiop commented Jul 21, 2019

efiop commented Jul 19, 2019 •

edited

Loading