Skip to content

Commit

Permalink
Add a test
Browse files Browse the repository at this point in the history
  • Loading branch information
mariosasko committed Nov 8, 2021
1 parent daaa0de commit e8f6fae
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions tests/test_arrow_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,18 @@ def test_optimized_typed_sequence(sequence, col, expected_dtype):
assert get_base_dtype(arr.type) == expected_dtype


def test_arrow_writer_typed_sequence_cls(monkeypatch):
stream = pa.BufferOutputStream()

with ArrowWriter(stream=stream) as writer:
assert writer.typed_sequence_cls == OptimizedTypedSequence

monkeypatch.setattr(config, "OPTIMIZE_PYARROW_TYPES", False)

with ArrowWriter(stream=stream) as writer:
assert writer.typed_sequence_cls == TypedSequence


@pytest.mark.parametrize("raise_exception", [False, True])
def test_arrow_writer_closes_stream(raise_exception, tmp_path):
path = str(tmp_path / "dataset-train.arrow")
Expand Down

1 comment on commit e8f6fae

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.082717 / 0.011353 (0.071364) 0.005166 / 0.011008 (-0.005842) 0.038269 / 0.038508 (-0.000239) 0.040630 / 0.023109 (0.017521) 0.362944 / 0.275898 (0.087046) 0.440839 / 0.323480 (0.117359) 0.093518 / 0.007986 (0.085532) 0.005797 / 0.004328 (0.001469) 0.010212 / 0.004250 (0.005961) 0.044186 / 0.037052 (0.007134) 0.360959 / 0.258489 (0.102470) 0.421332 / 0.293841 (0.127491) 0.109620 / 0.128546 (-0.018926) 0.013016 / 0.075646 (-0.062630) 0.327188 / 0.419271 (-0.092083) 0.059237 / 0.043533 (0.015704) 0.364689 / 0.255139 (0.109550) 0.447614 / 0.283200 (0.164414) 0.095827 / 0.141683 (-0.045856) 1.973502 / 1.452155 (0.521347) 2.218877 / 1.492716 (0.726160)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.322657 / 0.018006 (0.304650) 0.546935 / 0.000490 (0.546446) 0.029540 / 0.000200 (0.029340) 0.000343 / 0.000054 (0.000288)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.040181 / 0.037411 (0.002770) 0.026386 / 0.014526 (0.011860) 0.031646 / 0.176557 (-0.144911) 0.220278 / 0.737135 (-0.516857) 0.031869 / 0.296338 (-0.264469)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.624635 / 0.215209 (0.409425) 6.168971 / 2.077655 (4.091316) 2.408094 / 1.504120 (0.903974) 2.086988 / 1.541195 (0.545793) 2.222757 / 1.468490 (0.754267) 0.736480 / 4.584777 (-3.848297) 6.815529 / 3.745712 (3.069816) 5.017900 / 5.269862 (-0.251961) 1.442083 / 4.565676 (-3.123594) 0.083796 / 0.424275 (-0.340479) 0.013842 / 0.007607 (0.006235) 0.835134 / 0.226044 (0.609089) 7.890758 / 2.268929 (5.621829) 3.132469 / 55.444624 (-52.312156) 2.313127 / 6.876477 (-4.563350) 2.345223 / 2.142072 (0.203150) 0.880739 / 4.805227 (-3.924488) 0.177219 / 6.500664 (-6.323445) 0.066315 / 0.075469 (-0.009154)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.929129 / 1.841788 (0.087341) 14.115064 / 8.074308 (6.040756) 43.737557 / 10.191392 (33.546165) 1.015672 / 0.680424 (0.335249) 0.696535 / 0.534201 (0.162334) 0.471011 / 0.579283 (-0.108273) 0.714294 / 0.434364 (0.279931) 0.325301 / 0.540337 (-0.215037) 0.340193 / 1.386936 (-1.046743)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.075463 / 0.011353 (0.064110) 0.005967 / 0.011008 (-0.005042) 0.036579 / 0.038508 (-0.001929) 0.034489 / 0.023109 (0.011380) 0.366991 / 0.275898 (0.091092) 0.415170 / 0.323480 (0.091690) 0.088527 / 0.007986 (0.080541) 0.006141 / 0.004328 (0.001813) 0.008020 / 0.004250 (0.003769) 0.041514 / 0.037052 (0.004462) 0.386651 / 0.258489 (0.128162) 0.425091 / 0.293841 (0.131251) 0.097449 / 0.128546 (-0.031098) 0.015004 / 0.075646 (-0.060642) 0.317681 / 0.419271 (-0.101590) 0.059560 / 0.043533 (0.016027) 0.380405 / 0.255139 (0.125266) 0.414986 / 0.283200 (0.131786) 0.084887 / 0.141683 (-0.056796) 1.996854 / 1.452155 (0.544700) 1.992094 / 1.492716 (0.499377)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.236080 / 0.018006 (0.218074) 0.550803 / 0.000490 (0.550313) 0.004041 / 0.000200 (0.003841) 0.000826 / 0.000054 (0.000772)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034240 / 0.037411 (-0.003171) 0.024900 / 0.014526 (0.010374) 0.030786 / 0.176557 (-0.145771) 0.223504 / 0.737135 (-0.513631) 0.032109 / 0.296338 (-0.264229)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.623061 / 0.215209 (0.407852) 6.223844 / 2.077655 (4.146190) 2.409574 / 1.504120 (0.905454) 2.000418 / 1.541195 (0.459223) 2.089537 / 1.468490 (0.621047) 0.763252 / 4.584777 (-3.821525) 6.796419 / 3.745712 (3.050706) 3.223782 / 5.269862 (-2.046079) 1.544801 / 4.565676 (-3.020876) 0.089373 / 0.424275 (-0.334902) 0.012675 / 0.007607 (0.005067) 0.839269 / 0.226044 (0.613225) 7.944697 / 2.268929 (5.675768) 3.095585 / 55.444624 (-52.349039) 2.315163 / 6.876477 (-4.561314) 2.475727 / 2.142072 (0.333655) 0.920068 / 4.805227 (-3.885159) 0.183166 / 6.500664 (-6.317498) 0.069227 / 0.075469 (-0.006242)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.872578 / 1.841788 (0.030791) 14.116453 / 8.074308 (6.042145) 41.931230 / 10.191392 (31.739838) 0.988448 / 0.680424 (0.308024) 0.653851 / 0.534201 (0.119650) 0.454635 / 0.579283 (-0.124648) 0.721999 / 0.434364 (0.287635) 0.346703 / 0.540337 (-0.193634) 0.342415 / 1.386936 (-1.044521)

CML watermark

Please sign in to comment.