Skip to content

Commit

Permalink
asserts replaced by exception for text classification task with test. (
Browse files Browse the repository at this point in the history
…#3256)

* asserts replaced by exception for text classification task with test.

* Update tests/test_tasks.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
manisnesan and lhoestq authored Nov 12, 2021
1 parent 78ca9b8 commit bf2d230
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 1 deletion.
3 changes: 2 additions & 1 deletion src/datasets/tasks/text_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ class TextClassification(TaskTemplate):

def __post_init__(self):
if self.labels:
assert len(self.labels) == len(set(self.labels)), "Labels must be unique"
if len(self.labels) != len(set(self.labels)):
raise ValueError("Labels must be unique")
# Cast labels to tuple to allow hashing
self.__dict__["labels"] = tuple(sorted(self.labels))
self.__dict__["label_schema"] = self.label_schema.copy()
Expand Down
7 changes: 7 additions & 0 deletions tests/test_tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,13 @@ def test_from_dict(self):
self.assertEqual(input_schema, task.input_schema)
self.assertEqual(label_schema, task.label_schema)

def test_value_error_unique_labels(self):
with self.assertRaises(ValueError):
# Add duplicate labels
labels = self.labels + self.labels[:1]
task = TextClassification(text_column="input_text", label_column="input_label", labels=labels)
self.assertEqual("text-classification", task.task)


class QuestionAnsweringTest(TestCase):
def test_column_mapping(self):
Expand Down

1 comment on commit bf2d230

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.072344 / 0.011353 (0.060991) 0.005110 / 0.011008 (-0.005898) 0.036545 / 0.038508 (-0.001963) 0.039505 / 0.023109 (0.016396) 0.357026 / 0.275898 (0.081128) 0.394108 / 0.323480 (0.070628) 0.081556 / 0.007986 (0.073570) 0.006062 / 0.004328 (0.001734) 0.010623 / 0.004250 (0.006372) 0.040816 / 0.037052 (0.003764) 0.366902 / 0.258489 (0.108413) 0.429376 / 0.293841 (0.135535) 0.098601 / 0.128546 (-0.029945) 0.013866 / 0.075646 (-0.061781) 0.331035 / 0.419271 (-0.088237) 0.055084 / 0.043533 (0.011551) 0.373330 / 0.255139 (0.118191) 0.405412 / 0.283200 (0.122212) 0.092698 / 0.141683 (-0.048985) 2.042263 / 1.452155 (0.590108) 2.040364 / 1.492716 (0.547648)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.290982 / 0.018006 (0.272976) 0.535734 / 0.000490 (0.535244) 0.006714 / 0.000200 (0.006515) 0.000115 / 0.000054 (0.000060)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.041913 / 0.037411 (0.004501) 0.028642 / 0.014526 (0.014116) 0.041724 / 0.176557 (-0.134833) 0.232667 / 0.737135 (-0.504468) 0.034343 / 0.296338 (-0.261996)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.612353 / 0.215209 (0.397144) 6.087262 / 2.077655 (4.009607) 2.283062 / 1.504120 (0.778942) 1.910662 / 1.541195 (0.369467) 1.977800 / 1.468490 (0.509310) 0.725246 / 4.584777 (-3.859531) 6.787591 / 3.745712 (3.041878) 2.954241 / 5.269862 (-2.315620) 1.418504 / 4.565676 (-3.147173) 0.080112 / 0.424275 (-0.344163) 0.012837 / 0.007607 (0.005230) 0.769871 / 0.226044 (0.543827) 7.560292 / 2.268929 (5.291363) 3.080816 / 55.444624 (-52.363808) 2.413700 / 6.876477 (-4.462777) 2.443396 / 2.142072 (0.301324) 0.890021 / 4.805227 (-3.915206) 0.184138 / 6.500664 (-6.316526) 0.065528 / 0.075469 (-0.009941)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.889707 / 1.841788 (0.047919) 14.116234 / 8.074308 (6.041925) 43.314573 / 10.191392 (33.123181) 0.942532 / 0.680424 (0.262108) 0.621652 / 0.534201 (0.087451) 0.458600 / 0.579283 (-0.120683) 0.712019 / 0.434364 (0.277655) 0.325240 / 0.540337 (-0.215097) 0.340117 / 1.386936 (-1.046819)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.072859 / 0.011353 (0.061506) 0.004979 / 0.011008 (-0.006029) 0.035932 / 0.038508 (-0.002576) 0.035843 / 0.023109 (0.012734) 0.365172 / 0.275898 (0.089274) 0.394862 / 0.323480 (0.071382) 0.089488 / 0.007986 (0.081502) 0.004736 / 0.004328 (0.000407) 0.008927 / 0.004250 (0.004677) 0.037651 / 0.037052 (0.000599) 0.361591 / 0.258489 (0.103102) 0.387355 / 0.293841 (0.093514) 0.104315 / 0.128546 (-0.024231) 0.014978 / 0.075646 (-0.060668) 0.325711 / 0.419271 (-0.093560) 0.065939 / 0.043533 (0.022406) 0.355346 / 0.255139 (0.100207) 0.396915 / 0.283200 (0.113716) 0.099329 / 0.141683 (-0.042354) 1.904124 / 1.452155 (0.451969) 2.042748 / 1.492716 (0.550032)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.227073 / 0.018006 (0.209067) 0.552065 / 0.000490 (0.551576) 0.004819 / 0.000200 (0.004619) 0.000344 / 0.000054 (0.000289)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.038645 / 0.037411 (0.001234) 0.028795 / 0.014526 (0.014269) 0.032731 / 0.176557 (-0.143826) 0.241887 / 0.737135 (-0.495248) 0.037326 / 0.296338 (-0.259012)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.612164 / 0.215209 (0.396954) 6.359632 / 2.077655 (4.281977) 2.442718 / 1.504120 (0.938599) 1.981236 / 1.541195 (0.440041) 1.994983 / 1.468490 (0.526492) 0.712858 / 4.584777 (-3.871919) 6.722958 / 3.745712 (2.977246) 5.145879 / 5.269862 (-0.123982) 1.499465 / 4.565676 (-3.066211) 0.082842 / 0.424275 (-0.341433) 0.013808 / 0.007607 (0.006201) 0.827348 / 0.226044 (0.601304) 7.782450 / 2.268929 (5.513522) 3.072914 / 55.444624 (-52.371710) 2.403778 / 6.876477 (-4.472698) 2.437791 / 2.142072 (0.295719) 0.885990 / 4.805227 (-3.919238) 0.180909 / 6.500664 (-6.319756) 0.069935 / 0.075469 (-0.005534)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.931713 / 1.841788 (0.089926) 14.340735 / 8.074308 (6.266427) 43.461833 / 10.191392 (33.270441) 1.014007 / 0.680424 (0.333583) 0.731321 / 0.534201 (0.197120) 0.516852 / 0.579283 (-0.062431) 0.748834 / 0.434364 (0.314470) 0.355444 / 0.540337 (-0.184893) 0.408700 / 1.386936 (-0.978237)

CML watermark

Please sign in to comment.