Support working_dir in from_spark #5826

maddiedawson · 2023-05-05T20:22:40Z

Accept working_dir as an argument to Dataset.from_spark. Setting a non-NFS working directory for Spark workers to materialize to will improve write performance.

HuggingFaceDocBuilderDev · 2023-05-05T20:27:24Z

The documentation is not available anymore as the PR was closed or merged.

src/datasets/packaged_modules/spark/spark.py

lu-wang-dl

Could we also add the support by setting an environment variable for the working directory?

src/datasets/packaged_modules/spark/spark.py

maddiedawson · 2023-05-22T20:19:55Z

Added env var

maddiedawson · 2023-05-23T23:08:29Z

@lhoestq would you or another maintainer be able to review please? :)

lhoestq

Nice ! I think HF_WORKING_DIR is not a good env variable name for this case, I guess you can remove it.

src/datasets/packaged_modules/spark/spark.py

lhoestq · 2023-05-24T14:46:56Z

src/datasets/packaged_modules/spark/spark.py

@@ -221,6 +255,7 @@ def _prepare_split(
        self._validate_cache_dir()

        max_shard_size = convert_file_size_to_int(max_shard_size or MAX_SHARD_SIZE)
+        self._repartition_df_if_needed(max_shard_size)


maddiedawson · 2023-05-24T20:27:00Z

I removed the env var

lhoestq

LGTM :)

github-actions · 2023-05-25T08:59:06Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005771 / 0.011353 (-0.005582)	0.004086 / 0.011008 (-0.006922)	0.097170 / 0.038508 (0.058661)	0.027464 / 0.023109 (0.004355)	0.305425 / 0.275898 (0.029527)	0.343869 / 0.323480 (0.020389)	0.004899 / 0.007986 (-0.003087)	0.003294 / 0.004328 (-0.001034)	0.074710 / 0.004250 (0.070459)	0.034982 / 0.037052 (-0.002070)	0.306063 / 0.258489 (0.047574)	0.343115 / 0.293841 (0.049274)	0.025155 / 0.128546 (-0.103392)	0.008429 / 0.075646 (-0.067217)	0.318680 / 0.419271 (-0.100591)	0.043304 / 0.043533 (-0.000229)	0.306703 / 0.255139 (0.051564)	0.335535 / 0.283200 (0.052335)	0.087428 / 0.141683 (-0.054255)	1.483769 / 1.452155 (0.031614)	1.538753 / 1.492716 (0.046037)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.203313 / 0.018006 (0.185307)	0.413864 / 0.000490 (0.413375)	0.003186 / 0.000200 (0.002986)	0.000068 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022862 / 0.037411 (-0.014550)	0.097306 / 0.014526 (0.082780)	0.102823 / 0.176557 (-0.073733)	0.162803 / 0.737135 (-0.574333)	0.106311 / 0.296338 (-0.190028)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.451710 / 0.215209 (0.236501)	4.508520 / 2.077655 (2.430865)	2.181118 / 1.504120 (0.676998)	1.977607 / 1.541195 (0.436412)	2.008366 / 1.468490 (0.539876)	0.565388 / 4.584777 (-4.019389)	3.439318 / 3.745712 (-0.306394)	1.747512 / 5.269862 (-3.522349)	1.102124 / 4.565676 (-3.463553)	0.069212 / 0.424275 (-0.355063)	0.011926 / 0.007607 (0.004318)	0.553414 / 0.226044 (0.327370)	5.548959 / 2.268929 (3.280031)	2.628769 / 55.444624 (-52.815856)	2.301003 / 6.876477 (-4.575473)	2.341744 / 2.142072 (0.199672)	0.673092 / 4.805227 (-4.132135)	0.137722 / 6.500664 (-6.362942)	0.066909 / 0.075469 (-0.008560)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.196854 / 1.841788 (-0.644934)	13.421776 / 8.074308 (5.347468)	13.839760 / 10.191392 (3.648368)	0.140557 / 0.680424 (-0.539867)	0.016619 / 0.534201 (-0.517582)	0.357985 / 0.579283 (-0.221298)	0.387018 / 0.434364 (-0.047346)	0.452798 / 0.540337 (-0.087540)	0.542085 / 1.386936 (-0.844851)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005868 / 0.011353 (-0.005484)	0.004103 / 0.011008 (-0.006905)	0.076126 / 0.038508 (0.037618)	0.027744 / 0.023109 (0.004635)	0.357257 / 0.275898 (0.081359)	0.387981 / 0.323480 (0.064501)	0.004807 / 0.007986 (-0.003178)	0.003337 / 0.004328 (-0.000991)	0.075486 / 0.004250 (0.071236)	0.035121 / 0.037052 (-0.001931)	0.361385 / 0.258489 (0.102896)	0.399346 / 0.293841 (0.105505)	0.025263 / 0.128546 (-0.103284)	0.008571 / 0.075646 (-0.067075)	0.081815 / 0.419271 (-0.337457)	0.041114 / 0.043533 (-0.002418)	0.362840 / 0.255139 (0.107701)	0.380926 / 0.283200 (0.097727)	0.092728 / 0.141683 (-0.048955)	1.517647 / 1.452155 (0.065492)	1.534914 / 1.492716 (0.042198)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.199669 / 0.018006 (0.181663)	0.399070 / 0.000490 (0.398580)	0.002014 / 0.000200 (0.001814)	0.000079 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024541 / 0.037411 (-0.012870)	0.099676 / 0.014526 (0.085151)	0.106503 / 0.176557 (-0.070054)	0.153755 / 0.737135 (-0.583380)	0.108564 / 0.296338 (-0.187775)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.443842 / 0.215209 (0.228633)	4.441158 / 2.077655 (2.363503)	2.159496 / 1.504120 (0.655376)	1.955358 / 1.541195 (0.414163)	1.973864 / 1.468490 (0.505374)	0.550467 / 4.584777 (-4.034310)	3.381831 / 3.745712 (-0.363881)	2.561192 / 5.269862 (-2.708670)	1.361684 / 4.565676 (-3.203992)	0.068140 / 0.424275 (-0.356135)	0.012005 / 0.007607 (0.004398)	0.551921 / 0.226044 (0.325877)	5.503591 / 2.268929 (3.234662)	2.591609 / 55.444624 (-52.853015)	2.246681 / 6.876477 (-4.629796)	2.290941 / 2.142072 (0.148868)	0.655212 / 4.805227 (-4.150015)	0.136013 / 6.500664 (-6.364651)	0.066995 / 0.075469 (-0.008474)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.300438 / 1.841788 (-0.541350)	13.866224 / 8.074308 (5.791916)	13.932624 / 10.191392 (3.741232)	0.144345 / 0.680424 (-0.536079)	0.016623 / 0.534201 (-0.517578)	0.357629 / 0.579283 (-0.221654)	0.389759 / 0.434364 (-0.044605)	0.417704 / 0.540337 (-0.122633)	0.501358 / 1.386936 (-0.885578)

maddiedawson · 2023-05-25T17:45:51Z

Thank you!

maddiedawson changed the title ~~[WIP] Support working_dir in from_spark~~ Support working_dir in from_spark May 5, 2023

lu-wang-dl reviewed May 8, 2023

View reviewed changes

src/datasets/packaged_modules/spark/spark.py Outdated Show resolved Hide resolved

lu-wang-dl reviewed May 8, 2023

View reviewed changes

src/datasets/packaged_modules/spark/spark.py Show resolved Hide resolved

maddiedawson force-pushed the working_dir branch from 051a1b0 to 35c13d1 Compare May 10, 2023 05:51

maddiedawson requested a review from lu-wang-dl May 10, 2023 05:51

maddiedawson force-pushed the working_dir branch from 35c13d1 to 1b5f131 Compare May 10, 2023 05:52

lu-wang-dl reviewed May 17, 2023

View reviewed changes

src/datasets/packaged_modules/spark/spark.py Show resolved Hide resolved

maddiedawson added 4 commits May 22, 2023 12:14

Init

53b5a8c

Add _repartition_df_if_needed

c298d1b

Write file to working_dir then rename to fpath

c2f28db

Fix sampling and add unit tests

ccf5ba8

maddiedawson force-pushed the working_dir branch 2 times, most recently from 00ff574 to 347580f Compare May 22, 2023 20:19

maddiedawson requested a review from lu-wang-dl May 22, 2023 20:19

maddiedawson force-pushed the working_dir branch 2 times, most recently from 86d4429 to b7fdc35 Compare May 22, 2023 20:41

Address comments

ac3e4ec

maddiedawson force-pushed the working_dir branch from b7fdc35 to ac3e4ec Compare May 22, 2023 20:52

lu-wang-dl approved these changes May 24, 2023

View reviewed changes

lhoestq reviewed May 24, 2023

View reviewed changes

Address comments

38f85c1

maddiedawson requested a review from lhoestq May 24, 2023 20:26

lhoestq approved these changes May 25, 2023

View reviewed changes

lhoestq merged commit 89f7752 into huggingface:main May 25, 2023

maddiedawson deleted the working_dir branch May 25, 2023 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support working_dir in from_spark #5826

Support working_dir in from_spark #5826

maddiedawson commented May 5, 2023

HuggingFaceDocBuilderDev commented May 5, 2023 •

edited

Loading

lu-wang-dl left a comment

maddiedawson commented May 22, 2023

maddiedawson commented May 23, 2023 •

edited

Loading

lhoestq left a comment

lhoestq May 24, 2023

maddiedawson commented May 24, 2023

lhoestq left a comment

github-actions bot commented May 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

maddiedawson commented May 25, 2023

Support working_dir in from_spark #5826

Support working_dir in from_spark #5826

Conversation

maddiedawson commented May 5, 2023

HuggingFaceDocBuilderDev commented May 5, 2023 • edited Loading

lu-wang-dl left a comment

Choose a reason for hiding this comment

maddiedawson commented May 22, 2023

maddiedawson commented May 23, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq May 24, 2023

Choose a reason for hiding this comment

maddiedawson commented May 24, 2023

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented May 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

maddiedawson commented May 25, 2023

HuggingFaceDocBuilderDev commented May 5, 2023 •

edited

Loading

maddiedawson commented May 23, 2023 •

edited

Loading