Swap log messages for symbolic/hard links in tar extractor #5452

albertvillanova · 2023-01-23T07:53:38Z

The log messages do not match their if-condition. This PR swaps them.

Found while investigating:

resolving a weird tar extract issue #5441

CC: @lhoestq

HuggingFaceDocBuilderDev · 2023-01-23T07:58:01Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-01-23T08:40:54Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.011848 / 0.011353 (0.000495)	0.006988 / 0.011008 (-0.004020)	0.138078 / 0.038508 (0.099570)	0.040310 / 0.023109 (0.017201)	0.411857 / 0.275898 (0.135959)	0.509496 / 0.323480 (0.186016)	0.010695 / 0.007986 (0.002709)	0.005275 / 0.004328 (0.000946)	0.107157 / 0.004250 (0.102907)	0.050987 / 0.037052 (0.013935)	0.432387 / 0.258489 (0.173898)	0.495136 / 0.293841 (0.201295)	0.055273 / 0.128546 (-0.073273)	0.019573 / 0.075646 (-0.056074)	0.460356 / 0.419271 (0.041084)	0.060916 / 0.043533 (0.017383)	0.426140 / 0.255139 (0.171002)	0.430461 / 0.283200 (0.147261)	0.124569 / 0.141683 (-0.017114)	1.989404 / 1.452155 (0.537250)	1.942052 / 1.492716 (0.449335)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.287233 / 0.018006 (0.269227)	0.606056 / 0.000490 (0.605566)	0.004435 / 0.000200 (0.004235)	0.000144 / 0.000054 (0.000090)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032353 / 0.037411 (-0.005058)	0.124237 / 0.014526 (0.109711)	0.143280 / 0.176557 (-0.033276)	0.182081 / 0.737135 (-0.555055)	0.148085 / 0.296338 (-0.148253)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.613550 / 0.215209 (0.398341)	6.172421 / 2.077655 (4.094766)	2.466018 / 1.504120 (0.961898)	2.166433 / 1.541195 (0.625238)	2.192511 / 1.468490 (0.724021)	1.248777 / 4.584777 (-3.336000)	5.746150 / 3.745712 (2.000438)	3.097184 / 5.269862 (-2.172678)	2.078176 / 4.565676 (-2.487501)	0.144351 / 0.424275 (-0.279924)	0.014830 / 0.007607 (0.007223)	0.761699 / 0.226044 (0.535655)	7.713201 / 2.268929 (5.444272)	3.359647 / 55.444624 (-52.084977)	2.652595 / 6.876477 (-4.223882)	2.721952 / 2.142072 (0.579880)	1.493036 / 4.805227 (-3.312192)	0.252336 / 6.500664 (-6.248328)	0.082906 / 0.075469 (0.007436)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.643887 / 1.841788 (-0.197901)	18.762775 / 8.074308 (10.688466)	22.003583 / 10.191392 (11.812191)	0.256361 / 0.680424 (-0.424062)	0.048048 / 0.534201 (-0.486153)	0.601971 / 0.579283 (0.022688)	0.712801 / 0.434364 (0.278438)	0.684473 / 0.540337 (0.144136)	0.802566 / 1.386936 (-0.584370)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010410 / 0.011353 (-0.000943)	0.006719 / 0.011008 (-0.004289)	0.132862 / 0.038508 (0.094354)	0.036973 / 0.023109 (0.013863)	0.470925 / 0.275898 (0.195027)	0.502864 / 0.323480 (0.179384)	0.007447 / 0.007986 (-0.000539)	0.005629 / 0.004328 (0.001301)	0.091985 / 0.004250 (0.087734)	0.057537 / 0.037052 (0.020485)	0.458362 / 0.258489 (0.199873)	0.518324 / 0.293841 (0.224483)	0.056540 / 0.128546 (-0.072007)	0.021266 / 0.075646 (-0.054380)	0.448289 / 0.419271 (0.029018)	0.064211 / 0.043533 (0.020678)	0.492596 / 0.255139 (0.237457)	0.495030 / 0.283200 (0.211830)	0.121858 / 0.141683 (-0.019825)	1.823821 / 1.452155 (0.371667)	2.012165 / 1.492716 (0.519449)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.296252 / 0.018006 (0.278245)	0.601688 / 0.000490 (0.601198)	0.006369 / 0.000200 (0.006169)	0.000107 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035821 / 0.037411 (-0.001590)	0.132722 / 0.014526 (0.118196)	0.141819 / 0.176557 (-0.034738)	0.205115 / 0.737135 (-0.532020)	0.148917 / 0.296338 (-0.147422)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.678207 / 0.215209 (0.462998)	6.969918 / 2.077655 (4.892263)	3.077831 / 1.504120 (1.573711)	2.689296 / 1.541195 (1.148102)	2.706462 / 1.468490 (1.237972)	1.249125 / 4.584777 (-3.335652)	5.793917 / 3.745712 (2.048205)	3.137565 / 5.269862 (-2.132297)	2.056880 / 4.565676 (-2.508796)	0.151918 / 0.424275 (-0.272357)	0.015029 / 0.007607 (0.007422)	0.833975 / 0.226044 (0.607930)	8.575649 / 2.268929 (6.306720)	3.812115 / 55.444624 (-51.632509)	3.124219 / 6.876477 (-3.752258)	3.178645 / 2.142072 (1.036572)	1.488260 / 4.805227 (-3.316967)	0.268239 / 6.500664 (-6.232425)	0.089463 / 0.075469 (0.013993)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.645461 / 1.841788 (-0.196327)	19.074412 / 8.074308 (11.000104)	21.626726 / 10.191392 (11.435334)	0.210525 / 0.680424 (-0.469899)	0.032166 / 0.534201 (-0.502035)	0.555572 / 0.579283 (-0.023711)	0.654667 / 0.434364 (0.220303)	0.632471 / 0.540337 (0.092133)	0.756510 / 1.386936 (-0.630426)

Swap log messages for symbolic/hard links in tar extractor

a457f9a

albertvillanova merged commit 6681c36 into huggingface:main Jan 23, 2023

albertvillanova deleted the fix-extract-safemembers-logs branch January 23, 2023 08:31

albertvillanova mentioned this pull request Jan 23, 2023

Fix base directory while extracting insecure TAR files #5453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap log messages for symbolic/hard links in tar extractor #5452

Swap log messages for symbolic/hard links in tar extractor #5452

albertvillanova commented Jan 23, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 23, 2023 •

edited

Loading

github-actions bot commented Jan 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Swap log messages for symbolic/hard links in tar extractor #5452

Swap log messages for symbolic/hard links in tar extractor #5452

Conversation

albertvillanova commented Jan 23, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Jan 23, 2023 • edited Loading

github-actions bot commented Jan 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Jan 23, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 23, 2023 •

edited

Loading