fix: image array should support other formats than uint8 #5365

vigsterkr · 2022-12-15T13:17:50Z

Currently images that are provided as ndarrays, but not in uint8 format are going to loose data. Namely, for example in a depth image where the data is in float32 format, the type-casting to uint8 will basically make the whole image blank.
PIL.Image.fromarray does support mode F.

although maybe some further metadata could be supplied via the Image object.

HuggingFaceDocBuilderDev · 2022-12-15T13:22:15Z

The documentation is not available anymore as the PR was closed or merged.

…resenation

mariosasko · 2022-12-23T17:14:23Z

Hi, thanks for working on this!

I agree that the current type-casting (always cast to np.uint8 as Tensorflow Datasets does) is a bit too harsh. However, not all dtypes are supported in Image.fromarray (e.g. np.int64), so we need to treat these with special care (e.g. downcast to the closest supported dtype, maybe with warnings to let the user know what's happening).

PS: To avoid the CI failures, we need to handle two more instances of the cast to np.uint8 (both are in the image.py file).

mariosasko · 2023-01-26T15:47:34Z

I've made some changes to the PR.

Now the encoding procedure behaves as follows:

for multi-channel arrays: if their dtype is int/uint, cast to np.uint8 (the only supported dtype for multi-channel arrays), throw an error otherwise
if the array dtype is of valid kind ("u", "i", "f", ...):
- don't do anything if Pillow natively supports it
- otherwise, downcast until it becomes compatible with Pillow
raise an error if nothing from above is true

lhoestq

Looks all good :)

Can you also mention which precisions are supported and which ones are downcasted in the docs ?

Could be in https://huggingface.co/docs/datasets/about_dataset_features for examples (there is a paragraph on audio but none for image yet)

lhoestq

Just added some docs :) let me know if it sounds good to you @mariosasko and then we can merge IMO

mariosasko

Two nits regarding the docs

docs/source/about_dataset_features.mdx

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

github-actions · 2023-01-26T18:46:45Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009537 / 0.011353 (-0.001816)	0.004946 / 0.011008 (-0.006062)	0.100552 / 0.038508 (0.062043)	0.035119 / 0.023109 (0.012009)	0.295989 / 0.275898 (0.020091)	0.361326 / 0.323480 (0.037846)	0.007608 / 0.007986 (-0.000378)	0.004151 / 0.004328 (-0.000177)	0.077301 / 0.004250 (0.073050)	0.042921 / 0.037052 (0.005869)	0.304804 / 0.258489 (0.046315)	0.345934 / 0.293841 (0.052093)	0.038987 / 0.128546 (-0.089559)	0.012055 / 0.075646 (-0.063591)	0.334035 / 0.419271 (-0.085236)	0.052679 / 0.043533 (0.009146)	0.291700 / 0.255139 (0.036561)	0.335423 / 0.283200 (0.052223)	0.107002 / 0.141683 (-0.034680)	1.516780 / 1.452155 (0.064625)	1.514137 / 1.492716 (0.021420)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.014719 / 0.018006 (-0.003287)	0.545251 / 0.000490 (0.544761)	0.004719 / 0.000200 (0.004519)	0.000275 / 0.000054 (0.000220)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026633 / 0.037411 (-0.010779)	0.106911 / 0.014526 (0.092385)	0.120258 / 0.176557 (-0.056299)	0.156196 / 0.737135 (-0.580940)	0.123132 / 0.296338 (-0.173207)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.398018 / 0.215209 (0.182809)	3.973992 / 2.077655 (1.896337)	1.776436 / 1.504120 (0.272316)	1.579036 / 1.541195 (0.037841)	1.643345 / 1.468490 (0.174855)	0.692408 / 4.584777 (-3.892369)	3.757243 / 3.745712 (0.011531)	3.226212 / 5.269862 (-2.043649)	1.797845 / 4.565676 (-2.767831)	0.085878 / 0.424275 (-0.338398)	0.012451 / 0.007607 (0.004844)	0.509755 / 0.226044 (0.283711)	5.029035 / 2.268929 (2.760107)	2.255507 / 55.444624 (-53.189117)	1.892868 / 6.876477 (-4.983609)	1.900017 / 2.142072 (-0.242055)	0.853965 / 4.805227 (-3.951263)	0.167268 / 6.500664 (-6.333396)	0.062796 / 0.075469 (-0.012673)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.183361 / 1.841788 (-0.658427)	15.103797 / 8.074308 (7.029489)	14.112931 / 10.191392 (3.921539)	0.167234 / 0.680424 (-0.513190)	0.029487 / 0.534201 (-0.504713)	0.444121 / 0.579283 (-0.135162)	0.437821 / 0.434364 (0.003457)	0.544900 / 0.540337 (0.004562)	0.642142 / 1.386936 (-0.744794)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007078 / 0.011353 (-0.004275)	0.004983 / 0.011008 (-0.006026)	0.097106 / 0.038508 (0.058598)	0.033747 / 0.023109 (0.010637)	0.382030 / 0.275898 (0.106132)	0.410193 / 0.323480 (0.086713)	0.006658 / 0.007986 (-0.001327)	0.005358 / 0.004328 (0.001029)	0.073878 / 0.004250 (0.069628)	0.049292 / 0.037052 (0.012240)	0.384053 / 0.258489 (0.125564)	0.427826 / 0.293841 (0.133985)	0.036780 / 0.128546 (-0.091766)	0.012469 / 0.075646 (-0.063178)	0.332989 / 0.419271 (-0.086283)	0.059531 / 0.043533 (0.015998)	0.378431 / 0.255139 (0.123292)	0.402672 / 0.283200 (0.119473)	0.110782 / 0.141683 (-0.030901)	1.484570 / 1.452155 (0.032416)	1.608081 / 1.492716 (0.115365)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.232356 / 0.018006 (0.214350)	0.545648 / 0.000490 (0.545158)	0.003113 / 0.000200 (0.002913)	0.000089 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028138 / 0.037411 (-0.009273)	0.110786 / 0.014526 (0.096260)	0.123615 / 0.176557 (-0.052941)	0.165773 / 0.737135 (-0.571362)	0.126401 / 0.296338 (-0.169937)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.440518 / 0.215209 (0.225309)	4.393821 / 2.077655 (2.316166)	2.295479 / 1.504120 (0.791359)	2.116679 / 1.541195 (0.575485)	2.215561 / 1.468490 (0.747071)	0.722343 / 4.584777 (-3.862434)	3.783360 / 3.745712 (0.037647)	3.302242 / 5.269862 (-1.967620)	1.681535 / 4.565676 (-2.884142)	0.085738 / 0.424275 (-0.338537)	0.012373 / 0.007607 (0.004766)	0.540499 / 0.226044 (0.314455)	5.384915 / 2.268929 (3.115986)	2.766346 / 55.444624 (-52.678279)	2.451994 / 6.876477 (-4.424483)	2.505720 / 2.142072 (0.363647)	0.833006 / 4.805227 (-3.972221)	0.168206 / 6.500664 (-6.332458)	0.064971 / 0.075469 (-0.010498)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.253499 / 1.841788 (-0.588289)	15.381840 / 8.074308 (7.307532)	13.519493 / 10.191392 (3.328101)	0.165559 / 0.680424 (-0.514865)	0.017682 / 0.534201 (-0.516519)	0.422248 / 0.579283 (-0.157035)	0.422750 / 0.434364 (-0.011614)	0.524546 / 0.540337 (-0.015792)	0.626956 / 1.386936 (-0.759980)

fix: image array should support other formats than uint8

19647c7

Merge branch 'main' of github.com:huggingface/datasets into image_rep…

200ff82

…resenation

mariosasko added 4 commits January 26, 2023 13:46

Improve image array type conversion

874a46f

Test encoding np arrays

6aa3fc8

Fix

c47a36a

Oopsie

59f87a3

mariosasko requested a review from lhoestq January 26, 2023 15:47

lhoestq reviewed Jan 26, 2023

View reviewed changes

awsaf49 mentioned this pull request Jan 26, 2023

Discrepancy in nyu_depth_v2 dataset #5461

Open

lhoestq added 2 commits January 26, 2023 17:55

docs

b604435

minor

343c452

lhoestq approved these changes Jan 26, 2023

View reviewed changes

mariosasko reviewed Jan 26, 2023

View reviewed changes

docs/source/about_dataset_features.mdx Outdated Show resolved Hide resolved

docs/source/about_dataset_features.mdx Outdated Show resolved Hide resolved

Apply suggestions from code review

b0db7ca

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

lhoestq merged commit d9a8d8a into huggingface:main Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: image array should support other formats than uint8 #5365

fix: image array should support other formats than uint8 #5365

vigsterkr commented Dec 15, 2022

HuggingFaceDocBuilderDev commented Dec 15, 2022 •

edited

Loading

mariosasko commented Dec 23, 2022

mariosasko commented Jan 26, 2023

lhoestq left a comment

lhoestq left a comment

mariosasko left a comment

github-actions bot commented Jan 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

fix: image array should support other formats than uint8 #5365

fix: image array should support other formats than uint8 #5365

Conversation

vigsterkr commented Dec 15, 2022

HuggingFaceDocBuilderDev commented Dec 15, 2022 • edited Loading

mariosasko commented Dec 23, 2022

mariosasko commented Jan 26, 2023

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Dec 15, 2022 •

edited

Loading