Skip to content

Commit

Permalink
Remove Sequence-specific code
Browse files Browse the repository at this point in the history
  • Loading branch information
Rocketknight1 committed Sep 8, 2022
1 parent 31a6d58 commit c1b98ee
Showing 1 changed file with 11 additions and 18 deletions.
29 changes: 11 additions & 18 deletions src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,24 +299,17 @@ def _get_output_signature(
f"Unrecognized array dtype {np_arrays[0].dtype}. \n"
"Nested types and image/audio types are not supported yet."
)
if (
column in dataset
and isinstance(dataset.features[column], Sequence)
and dataset.features[column].length != -1
):
static_shape = [batch_size, dataset.features[column].length]
else:
shapes = [array.shape for array in np_arrays]
static_shape = []
for dim in range(len(shapes[0])):
sizes = set([shape[dim] for shape in shapes])
if dim == 0:
static_shape.append(batch_size)
continue
if len(sizes) == 1: # This dimension looks constant
static_shape.append(sizes.pop())
else: # Use None for variable dimensions
static_shape.append(None)
shapes = [array.shape for array in np_arrays]
static_shape = []
for dim in range(len(shapes[0])):
sizes = set([shape[dim] for shape in shapes])
if dim == 0:
static_shape.append(batch_size)
continue
if len(sizes) == 1: # This dimension looks constant
static_shape.append(sizes.pop())
else: # Use None for variable dimensions
static_shape.append(None)
tf_columns_to_signatures[column] = tf.TensorSpec(shape=static_shape, dtype=tf_dtype)
np_columns_to_dtypes[column] = np_dtype

Expand Down

1 comment on commit c1b98ee

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008317 / 0.011353 (-0.003036) 0.003947 / 0.011008 (-0.007061) 0.030857 / 0.038508 (-0.007652) 0.035597 / 0.023109 (0.012488) 0.294985 / 0.275898 (0.019087) 0.359178 / 0.323480 (0.035698) 0.005996 / 0.007986 (-0.001989) 0.003464 / 0.004328 (-0.000864) 0.006973 / 0.004250 (0.002722) 0.049018 / 0.037052 (0.011965) 0.308525 / 0.258489 (0.050036) 0.345970 / 0.293841 (0.052129) 0.031661 / 0.128546 (-0.096885) 0.009681 / 0.075646 (-0.065965) 0.273930 / 0.419271 (-0.145342) 0.052943 / 0.043533 (0.009410) 0.294895 / 0.255139 (0.039756) 0.320682 / 0.283200 (0.037482) 0.104289 / 0.141683 (-0.037394) 1.516638 / 1.452155 (0.064484) 1.507585 / 1.492716 (0.014868)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.233626 / 0.018006 (0.215619) 0.479669 / 0.000490 (0.479179) 0.002460 / 0.000200 (0.002260) 0.000123 / 0.000054 (0.000069)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024219 / 0.037411 (-0.013192) 0.106479 / 0.014526 (0.091953) 0.120773 / 0.176557 (-0.055783) 0.167794 / 0.737135 (-0.569341) 0.121027 / 0.296338 (-0.175311)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.396468 / 0.215209 (0.181259) 3.941044 / 2.077655 (1.863389) 1.815331 / 1.504120 (0.311211) 1.628284 / 1.541195 (0.087089) 1.703928 / 1.468490 (0.235438) 0.424470 / 4.584777 (-4.160307) 4.140897 / 3.745712 (0.395185) 2.069596 / 5.269862 (-3.200266) 1.243988 / 4.565676 (-3.321689) 0.051768 / 0.424275 (-0.372507) 0.011068 / 0.007607 (0.003461) 0.498774 / 0.226044 (0.272730) 4.981013 / 2.268929 (2.712085) 2.246379 / 55.444624 (-53.198245) 1.886572 / 6.876477 (-4.989905) 2.055250 / 2.142072 (-0.086823) 0.539013 / 4.805227 (-4.266214) 0.120243 / 6.500664 (-6.380421) 0.060463 / 0.075469 (-0.015006)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.477209 / 1.841788 (-0.364579) 14.197263 / 8.074308 (6.122955) 25.824324 / 10.191392 (15.632932) 0.886482 / 0.680424 (0.206058) 0.547878 / 0.534201 (0.013678) 0.387530 / 0.579283 (-0.191753) 0.456824 / 0.434364 (0.022460) 0.275973 / 0.540337 (-0.264364) 0.283912 / 1.386936 (-1.103024)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006326 / 0.011353 (-0.005027) 0.004038 / 0.011008 (-0.006970) 0.027929 / 0.038508 (-0.010579) 0.034946 / 0.023109 (0.011837) 0.382988 / 0.275898 (0.107090) 0.452607 / 0.323480 (0.129127) 0.004259 / 0.007986 (-0.003727) 0.004795 / 0.004328 (0.000466) 0.004947 / 0.004250 (0.000697) 0.046094 / 0.037052 (0.009041) 0.393427 / 0.258489 (0.134938) 0.436027 / 0.293841 (0.142186) 0.030561 / 0.128546 (-0.097985) 0.009729 / 0.075646 (-0.065917) 0.257352 / 0.419271 (-0.161920) 0.067855 / 0.043533 (0.024322) 0.383285 / 0.255139 (0.128146) 0.402589 / 0.283200 (0.119389) 0.112831 / 0.141683 (-0.028852) 1.473492 / 1.452155 (0.021337) 1.491951 / 1.492716 (-0.000765)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.223024 / 0.018006 (0.205018) 0.444049 / 0.000490 (0.443559) 0.003573 / 0.000200 (0.003373) 0.000088 / 0.000054 (0.000033)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022945 / 0.037411 (-0.014466) 0.104748 / 0.014526 (0.090222) 0.115552 / 0.176557 (-0.061005) 0.158990 / 0.737135 (-0.578145) 0.118497 / 0.296338 (-0.177841)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.420540 / 0.215209 (0.205331) 4.175657 / 2.077655 (2.098002) 2.005529 / 1.504120 (0.501409) 1.812445 / 1.541195 (0.271250) 1.934206 / 1.468490 (0.465716) 0.425607 / 4.584777 (-4.159170) 3.763531 / 3.745712 (0.017819) 2.000830 / 5.269862 (-3.269031) 1.249015 / 4.565676 (-3.316662) 0.051643 / 0.424275 (-0.372632) 0.011767 / 0.007607 (0.004160) 0.519798 / 0.226044 (0.293754) 5.195249 / 2.268929 (2.926320) 2.486748 / 55.444624 (-52.957876) 2.121577 / 6.876477 (-4.754900) 2.308200 / 2.142072 (0.166128) 0.533295 / 4.805227 (-4.271932) 0.120863 / 6.500664 (-6.379801) 0.061376 / 0.075469 (-0.014093)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.557500 / 1.841788 (-0.284288) 14.182423 / 8.074308 (6.108115) 25.109841 / 10.191392 (14.918449) 0.979421 / 0.680424 (0.298997) 0.615707 / 0.534201 (0.081506) 0.390425 / 0.579283 (-0.188858) 0.443574 / 0.434364 (0.009211) 0.279692 / 0.540337 (-0.260646) 0.284374 / 1.386936 (-1.102562)

CML watermark

Please sign in to comment.