Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to to_tf_dataset #3085

Merged
merged 5 commits into from
Oct 21, 2021
Merged

Fixes to to_tf_dataset #3085

merged 5 commits into from
Oct 21, 2021

Conversation

Rocketknight1
Copy link
Member

No description provided.

@lhoestq
Copy link
Member

lhoestq commented Oct 18, 2021

Hi ! Can you give some details about why you need these changes ?

@Rocketknight1
Copy link
Member Author

Hey, sorry, I should have explained! I've been getting a lot of VisibleDeprecationWarning from Numpy, due to an issue in the formatter, see #3084 . This is a temporary workaround (since I'm using these methods in the upcoming course) until I can fix that issue, because I couldn't see an obvious fix for the Numpy formatter. If you can see a quick way to fix that, though, that might be even better!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see :)

for col in cols_to_retain:
if col not in self.features:
raise ValueError(f"Couldn't find column {col} in dataset.")
if col not in self.features and not (col in ("attention_mask", "labels") and collate_fn is not None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hardcode some column names here ? It feels hacky

Changing the collate_fn function could break this no ?

Copy link
Member Author

@Rocketknight1 Rocketknight1 Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very hacky, yeah. I need to change this to make it work properly, but I was under time pressure to get notebooks and everything ready in time to record videos for the course.

I think a better solution would be to take a remove_columns list instead of columns, and then I wouldn't have to worry so much about new columns being added by the data collator - I assume that all of those are being kept. WDYT?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these comments :)

Let's have this for now !

@lhoestq lhoestq merged commit a1c8b49 into master Oct 21, 2021
@lhoestq lhoestq deleted the tf_dataset_fix branch October 21, 2021 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants