Fixes to `to_tf_dataset` #3085

Rocketknight1 · 2021-10-14T14:25:56Z

No description provided.

lhoestq · 2021-10-18T10:15:25Z

Hi ! Can you give some details about why you need these changes ?

Rocketknight1 · 2021-10-18T12:31:28Z

Hey, sorry, I should have explained! I've been getting a lot of VisibleDeprecationWarning from Numpy, due to an issue in the formatter, see #3084 . This is a temporary workaround (since I'm using these methods in the upcoming course) until I can fix that issue, because I couldn't see an obvious fix for the Numpy formatter. If you can see a quick way to fix that, though, that might be even better!

lhoestq

Ok I see :)

lhoestq · 2021-10-18T16:19:01Z

src/datasets/arrow_dataset.py

        for col in cols_to_retain:
-            if col not in self.features:
-                raise ValueError(f"Couldn't find column {col} in dataset.")
+            if col not in self.features and not (col in ("attention_mask", "labels") and collate_fn is not None):


Why hardcode some column names here ? It feels hacky

Changing the collate_fn function could break this no ?

It's very hacky, yeah. I need to change this to make it work properly, but I was under time pressure to get notebooks and everything ready in time to record videos for the course.

I think a better solution would be to take a remove_columns list instead of columns, and then I wouldn't have to worry so much about new columns being added by the data collator - I assume that all of those are being kept. WDYT?

lhoestq

Thanks for adding these comments :)

Let's have this for now !

lhoestq reviewed Oct 18, 2021

View reviewed changes

Rocketknight1 added 3 commits October 19, 2021 19:22

Fix for columns added by the collation function

bb137a7

More special-casing around labels

a2e4ac8

Style pass

a1d21ba

Rocketknight1 force-pushed the tf_dataset_fix branch from 240a61c to a1d21ba Compare October 19, 2021 18:23

Rocketknight1 added 2 commits October 20, 2021 17:23

Tweak to handling of column names

5c30a90

Adding TODO with the roadmap

41565da

lhoestq approved these changes Oct 21, 2021

View reviewed changes

lhoestq merged commit a1c8b49 into master Oct 21, 2021

lhoestq deleted the tf_dataset_fix branch October 21, 2021 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes to `to_tf_dataset` #3085

Fixes to `to_tf_dataset` #3085

Rocketknight1 commented Oct 14, 2021

lhoestq commented Oct 18, 2021 •

edited

Loading

Rocketknight1 commented Oct 18, 2021

lhoestq left a comment

lhoestq Oct 18, 2021

Rocketknight1 Oct 18, 2021 •

edited

Loading

lhoestq left a comment

Fixes to to_tf_dataset #3085

Fixes to to_tf_dataset #3085

Conversation

Rocketknight1 commented Oct 14, 2021

lhoestq commented Oct 18, 2021 • edited Loading

Rocketknight1 commented Oct 18, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Oct 18, 2021

Choose a reason for hiding this comment

Rocketknight1 Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

Fixes to `to_tf_dataset` #3085

Fixes to `to_tf_dataset` #3085

lhoestq commented Oct 18, 2021 •

edited

Loading

Rocketknight1 Oct 18, 2021 •

edited

Loading