don't zero copy timestamps #5504

dwyatte · 2023-02-03T23:39:04Z

I'm not sure whether we prefer a test here or if timestamps are known to be unsupported (like booleans). The current test at least covers the bug

HuggingFaceDocBuilderDev · 2023-02-05T14:09:49Z

The documentation is not available anymore as the PR was closed or merged.

albertvillanova

Thanks for the fix, @dwyatte.

mariosasko

Thanks! I modified the test a bit to make it more consistent with the rest of the "extractor" tests. It looks all good now.

PS: The CI failures are unrelated to the changes

github-actions · 2023-02-08T14:40:13Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008606 / 0.011353 (-0.002747)	0.004659 / 0.011008 (-0.006349)	0.101311 / 0.038508 (0.062802)	0.029664 / 0.023109 (0.006555)	0.321850 / 0.275898 (0.045952)	0.380497 / 0.323480 (0.057017)	0.007003 / 0.007986 (-0.000982)	0.003393 / 0.004328 (-0.000936)	0.078704 / 0.004250 (0.074453)	0.035810 / 0.037052 (-0.001242)	0.327271 / 0.258489 (0.068782)	0.369302 / 0.293841 (0.075461)	0.033625 / 0.128546 (-0.094921)	0.011563 / 0.075646 (-0.064084)	0.323950 / 0.419271 (-0.095322)	0.040660 / 0.043533 (-0.002872)	0.327211 / 0.255139 (0.072072)	0.350325 / 0.283200 (0.067125)	0.085427 / 0.141683 (-0.056256)	1.464370 / 1.452155 (0.012216)	1.490355 / 1.492716 (-0.002362)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.202879 / 0.018006 (0.184873)	0.419836 / 0.000490 (0.419346)	0.000303 / 0.000200 (0.000103)	0.000063 / 0.000054 (0.000008)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023336 / 0.037411 (-0.014075)	0.096817 / 0.014526 (0.082291)	0.103990 / 0.176557 (-0.072567)	0.137749 / 0.737135 (-0.599386)	0.108236 / 0.296338 (-0.188102)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.420801 / 0.215209 (0.205592)	4.205308 / 2.077655 (2.127653)	2.050363 / 1.504120 (0.546243)	1.877390 / 1.541195 (0.336195)	2.031060 / 1.468490 (0.562570)	0.687950 / 4.584777 (-3.896827)	3.363202 / 3.745712 (-0.382510)	1.869482 / 5.269862 (-3.400379)	1.159131 / 4.565676 (-3.406545)	0.082374 / 0.424275 (-0.341901)	0.012425 / 0.007607 (0.004818)	0.519775 / 0.226044 (0.293731)	5.244612 / 2.268929 (2.975684)	2.371314 / 55.444624 (-53.073311)	2.052713 / 6.876477 (-4.823764)	2.190015 / 2.142072 (0.047942)	0.803806 / 4.805227 (-4.001421)	0.148110 / 6.500664 (-6.352554)	0.064174 / 0.075469 (-0.011295)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.250424 / 1.841788 (-0.591364)	13.487870 / 8.074308 (5.413561)	13.080736 / 10.191392 (2.889344)	0.147715 / 0.680424 (-0.532709)	0.028409 / 0.534201 (-0.505792)	0.397531 / 0.579283 (-0.181752)	0.399458 / 0.434364 (-0.034905)	0.461467 / 0.540337 (-0.078871)	0.541639 / 1.386936 (-0.845297)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006753 / 0.011353 (-0.004600)	0.004573 / 0.011008 (-0.006435)	0.076122 / 0.038508 (0.037614)	0.027529 / 0.023109 (0.004419)	0.341291 / 0.275898 (0.065393)	0.376889 / 0.323480 (0.053409)	0.005032 / 0.007986 (-0.002953)	0.003447 / 0.004328 (-0.000882)	0.075186 / 0.004250 (0.070936)	0.038516 / 0.037052 (0.001463)	0.340927 / 0.258489 (0.082438)	0.386626 / 0.293841 (0.092785)	0.031929 / 0.128546 (-0.096617)	0.011759 / 0.075646 (-0.063888)	0.085616 / 0.419271 (-0.333656)	0.042858 / 0.043533 (-0.000674)	0.341881 / 0.255139 (0.086742)	0.367502 / 0.283200 (0.084303)	0.090788 / 0.141683 (-0.050895)	1.472871 / 1.452155 (0.020716)	1.577825 / 1.492716 (0.085109)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.233137 / 0.018006 (0.215131)	0.415016 / 0.000490 (0.414526)	0.000379 / 0.000200 (0.000179)	0.000059 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024966 / 0.037411 (-0.012445)	0.102794 / 0.014526 (0.088268)	0.107543 / 0.176557 (-0.069014)	0.143133 / 0.737135 (-0.594002)	0.111494 / 0.296338 (-0.184845)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.438354 / 0.215209 (0.223145)	4.382244 / 2.077655 (2.304589)	2.056340 / 1.504120 (0.552220)	1.851524 / 1.541195 (0.310330)	1.933147 / 1.468490 (0.464657)	0.701446 / 4.584777 (-3.883331)	3.396893 / 3.745712 (-0.348819)	2.837516 / 5.269862 (-2.432346)	1.538298 / 4.565676 (-3.027379)	0.083449 / 0.424275 (-0.340826)	0.012793 / 0.007607 (0.005186)	0.539661 / 0.226044 (0.313616)	5.428415 / 2.268929 (3.159487)	2.527582 / 55.444624 (-52.917042)	2.172795 / 6.876477 (-4.703682)	2.220011 / 2.142072 (0.077938)	0.814338 / 4.805227 (-3.990889)	0.153468 / 6.500664 (-6.347196)	0.069056 / 0.075469 (-0.006413)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.278434 / 1.841788 (-0.563354)	14.284924 / 8.074308 (6.210616)	13.486596 / 10.191392 (3.295203)	0.138457 / 0.680424 (-0.541967)	0.016609 / 0.534201 (-0.517592)	0.382828 / 0.579283 (-0.196455)	0.387604 / 0.434364 (-0.046760)	0.478801 / 0.540337 (-0.061536)	0.565352 / 1.386936 (-0.821584)

dwyatte · 2023-02-08T17:28:50Z

Thanks! I modified the test a bit to make it more consistent with the rest of the "extractor" tests.

Appreciate the assist on the tests! 🚀

* don't zero copy timestamps * Improve comment and tests * Fix test --------- Co-authored-by: mariosasko <mariosasko777@gmail.com>

don't zero copy timestamps

177f96f

dwyatte mentioned this pull request Feb 3, 2023

to_tf_dataset fails with datetime UTC columns even if not included in columns argument #5495

Closed

albertvillanova approved these changes Feb 6, 2023

View reviewed changes

mariosasko added 2 commits February 7, 2023 17:36

Improve comment and tests

7204511

Fix test

4d64bbb

mariosasko approved these changes Feb 8, 2023

View reviewed changes

mariosasko merged commit c39ba50 into huggingface:main Feb 8, 2023

filip-halt pushed a commit to filip-halt/datasets that referenced this pull request Feb 16, 2023

don't zero copy timestamps (huggingface#5504)

8fec7db

* don't zero copy timestamps * Improve comment and tests * Fix test --------- Co-authored-by: mariosasko <mariosasko777@gmail.com>

filip-halt pushed a commit to filip-halt/datasets that referenced this pull request Feb 16, 2023

don't zero copy timestamps (huggingface#5504)

79b4f91

* don't zero copy timestamps * Improve comment and tests * Fix test --------- Co-authored-by: mariosasko <mariosasko777@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't zero copy timestamps #5504

don't zero copy timestamps #5504

dwyatte commented Feb 3, 2023

HuggingFaceDocBuilderDev commented Feb 5, 2023 •

edited

Loading

albertvillanova left a comment

mariosasko left a comment

github-actions bot commented Feb 8, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

dwyatte commented Feb 8, 2023

don't zero copy timestamps #5504

don't zero copy timestamps #5504

Conversation

dwyatte commented Feb 3, 2023

HuggingFaceDocBuilderDev commented Feb 5, 2023 • edited Loading

albertvillanova left a comment

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 8, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

dwyatte commented Feb 8, 2023

HuggingFaceDocBuilderDev commented Feb 5, 2023 •

edited

Loading