Skip to content

Commit

Permalink
fix class_encode_column issue
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Apr 27, 2021
1 parent d93bc76 commit 88676c9
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 1 deletion.
1 change: 1 addition & 0 deletions src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -775,6 +775,7 @@ def class_encode_column(self, column: str) -> "Dataset":
class_names = sorted(dset.unique(column))
dst_feat = ClassLabel(names=class_names)
dset = dset.map(lambda batch: {column: dst_feat.str2int(batch)}, input_columns=column, batched=True)
dset = concatenate_datasets([self.remove_columns([column]), dset], axis=1)

new_features = copy.deepcopy(dset.features)
new_features[column] = dst_feat
Expand Down
2 changes: 1 addition & 1 deletion tests/test_arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1975,7 +1975,7 @@ def test_dataset_add_item(item, in_memory, dataset_dict, arrow_path, transform):
dataset = dataset_to_test.add_item(item)
assert dataset.data.shape == (5, 3)
expected_features = dataset_to_test.features
assert dataset.data.column_names == list(expected_features.keys())
assert sorted(dataset.data.column_names) == sorted(expected_features.keys())
for feature, expected_dtype in expected_features.items():
assert dataset.features[feature] == expected_dtype
assert len(dataset.data.blocks) == 1 if in_memory else 2 # multiple InMemoryTables are consolidated as one
Expand Down

1 comment on commit 88676c9

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.025936 / 0.011353 (0.014583) 0.018165 / 0.011008 (0.007156) 0.059747 / 0.038508 (0.021239) 0.042048 / 0.023109 (0.018939) 0.379020 / 0.275898 (0.103122) 0.415082 / 0.323480 (0.091603) 0.012189 / 0.007986 (0.004203) 0.005429 / 0.004328 (0.001100) 0.012348 / 0.004250 (0.008097) 0.054342 / 0.037052 (0.017290) 0.374849 / 0.258489 (0.116360) 0.425765 / 0.293841 (0.131924) 0.184024 / 0.128546 (0.055478) 0.152866 / 0.075646 (0.077220) 0.511101 / 0.419271 (0.091829) 0.487830 / 0.043533 (0.444297) 0.384980 / 0.255139 (0.129841) 0.427988 / 0.283200 (0.144788) 1.866287 / 0.141683 (1.724605) 2.066700 / 1.452155 (0.614545) 2.114341 / 1.492716 (0.621624)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.007329 / 0.018006 (-0.010677) 0.000537 / 0.000490 (0.000047) 0.000243 / 0.000200 (0.000043) 0.000066 / 0.000054 (0.000011)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.050458 / 0.037411 (0.013047) 0.034077 / 0.014526 (0.019551) 0.032213 / 0.176557 (-0.144343) 0.052693 / 0.737135 (-0.684442) 0.034040 / 0.296338 (-0.262298)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.505129 / 0.215209 (0.289920) 5.200320 / 2.077655 (3.122665) 2.522077 / 1.504120 (1.017957) 2.101996 / 1.541195 (0.560801) 2.221243 / 1.468490 (0.752753) 8.017117 / 4.584777 (3.432340) 7.168917 / 3.745712 (3.423205) 9.874169 / 5.269862 (4.604308) 8.652422 / 4.565676 (4.086745) 0.772978 / 0.424275 (0.348703) 0.012183 / 0.007607 (0.004576) 0.641325 / 0.226044 (0.415281) 6.418349 / 2.268929 (4.149421) 3.101300 / 55.444624 (-52.343325) 2.505235 / 6.876477 (-4.371242) 2.507052 / 2.142072 (0.364980) 7.977510 / 4.805227 (3.172282) 6.544633 / 6.500664 (0.043968) 8.744081 / 0.075469 (8.668612)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.989074 / 1.841788 (11.147286) 14.633838 / 8.074308 (6.559530) 40.590221 / 10.191392 (30.398829) 0.983014 / 0.680424 (0.302590) 0.729347 / 0.534201 (0.195146) 0.905265 / 0.579283 (0.325982) 0.715554 / 0.434364 (0.281190) 0.838742 / 0.540337 (0.298405) 1.793589 / 1.386936 (0.406653)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.026886 / 0.011353 (0.015533) 0.017561 / 0.011008 (0.006553) 0.058976 / 0.038508 (0.020468) 0.041224 / 0.023109 (0.018115) 0.396597 / 0.275898 (0.120699) 0.461404 / 0.323480 (0.137924) 0.012931 / 0.007986 (0.004945) 0.005445 / 0.004328 (0.001117) 0.013155 / 0.004250 (0.008905) 0.060190 / 0.037052 (0.023137) 0.400623 / 0.258489 (0.142134) 0.442945 / 0.293841 (0.149104) 0.187878 / 0.128546 (0.059331) 0.145106 / 0.075646 (0.069460) 0.499398 / 0.419271 (0.080126) 0.474107 / 0.043533 (0.430574) 0.388590 / 0.255139 (0.133451) 0.441590 / 0.283200 (0.158391) 1.903157 / 0.141683 (1.761474) 2.042247 / 1.452155 (0.590093) 2.145516 / 1.492716 (0.652800)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.006759 / 0.018006 (-0.011247) 0.000493 / 0.000490 (0.000003) 0.000195 / 0.000200 (-0.000005) 0.000058 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.044554 / 0.037411 (0.007143) 0.025192 / 0.014526 (0.010666) 0.030737 / 0.176557 (-0.145820) 0.052159 / 0.737135 (-0.684977) 0.031519 / 0.296338 (-0.264819)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.542999 / 0.215209 (0.327790) 5.326972 / 2.077655 (3.249317) 2.566535 / 1.504120 (1.062415) 2.261966 / 1.541195 (0.720771) 2.281346 / 1.468490 (0.812856) 7.838847 / 4.584777 (3.254070) 6.934344 / 3.745712 (3.188632) 9.613210 / 5.269862 (4.343349) 8.542079 / 4.565676 (3.976402) 0.762304 / 0.424275 (0.338029) 0.011782 / 0.007607 (0.004175) 0.645367 / 0.226044 (0.419322) 6.681812 / 2.268929 (4.412884) 3.175415 / 55.444624 (-52.269209) 2.586550 / 6.876477 (-4.289926) 2.644438 / 2.142072 (0.502366) 7.940694 / 4.805227 (3.135467) 7.568533 / 6.500664 (1.067869) 8.763540 / 0.075469 (8.688071)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.800200 / 1.841788 (10.958412) 14.138745 / 8.074308 (6.064436) 41.179796 / 10.191392 (30.988404) 0.930345 / 0.680424 (0.249921) 0.692000 / 0.534201 (0.157799) 0.877102 / 0.579283 (0.297818) 0.700112 / 0.434364 (0.265749) 0.808829 / 0.540337 (0.268491) 1.732548 / 1.386936 (0.345612)

CML watermark

Please sign in to comment.